Using Azure DevOps, Databricks Spark, Cosmos DB Gremlin API and Azure Data Factory

A. Introduction

Creating a data pipeline is one thing; bringing it into production is another. This is especially true for a modern data pipeline in which multiple services are used for advanced analytics. Examples are transforming unstructured data to structured data, training of ML models and embedding OCR. Integration of multiple services can be complicated and deployment to production has to be controlled. In this blog, an example project is provided as follows:

  • 1. Setup an Azure DevOps project for contineous deployment
  • 2. Deploy Azure resources of data pipeline using infrastructure as code
  • 3. Run and monitor data pipeline

The code from the project can be found here, the steps of the modern data pipeline are depicted below. …

Manage access to your app with identities, roles and permissions


The Microsoft identity platform is key to secure access to your web app. Users can authenticate to your app using their Azure AD identity or social accounts. The authorization model can be used to grant permissions to your backend app or standard APIs like Microsoft Graph. In this blog, a web application is discussed that does the following:

  • 1. Azure AD login with user role: basic or user role: premium
  • 2. Access to MS Graph using delegated permissions of signed-in user
  • 3. Access to backend using application permissions and app role

The code of the project can be found here, architecture can be found below. …

Learn to use identities and tokens in web apps and Azure SQL

1. Introduction

Python Flask is a popular tool to create web applications. Using Azure AD, users can authenticate to the REST APIs and retrieve data from Azure SQL. In this blog, a sample Python web application is created as follows:

  • 1a: User logs in to web app and acquires a token
  • 1b: User calls a REST API to request a dataset
  • 2: Web app uses claims in token to verify user access to dataset
  • 3: Web app retrieves data from Azure SQL. Web app can be configured such that either the a) managed identity of the app or b) signed-in user identity is used for authentication to the…

Learn to write graph data in Cosmos DB using Gremlin and then to read/analyze data in Azure Databricks with GraphFrames.

0. Introduction

Azure Cosmos DB is a fully managed multi-database service. It enables you to build highly responsive applications worldwide. As part of Cosmos DB, Gremlin is supported for graph databases. Since Cosmos DB is optimized for fast processing (OLTP), traversal limits may apply for heavy analytic workloads (OLAP). In that case, Azure Databricks and GraphFrames can be used as an alternative to do advanced analytics, see also architecture below.

Image for post
Image for post
0. Architecture (image by Author)

In the remaining of blog, the following is done:

  • OLTP: write graph data to Cosmos DB using the Gremlin API and Python
  • OLAP: read data from Cosmos DB, analyze data in Azure…

Learn to create an Azure Function using a custom docker image to run a Selenium web scraper in Python

A. Introduction

Selenium is the standard tool for automated web browser testing. On top of that, Selenium is a popular tool for web scraping. When creating a web scraper in Azure, Azure Functions is a logical candidate to run your code in. However, the default Azure Functions image does not contain the dependencies that Selenium requires. In this blog, a web scraper in Azure Functions is created as follows:

  • Create and deploy docker image as Azure Function with Selenium
  • Scrape websites periodically and store results

The architecture of web scraper is depicted below.

Image for post
Image for post
A. Architecture to build a Selenium web scaper (image by Author)

In the remaining the steps are discussed to deploy and run your web scraper in Azure Functions. For details how to secure your Azure Functions, see this blog. For details how to create a custom docker image with OpenCV in Azure Functions, see here and DockerFile here. …

Learn how to automatically backup your data lake using blob snapshots and Data Factory incremental backups

1. Azure Storage backup - Introduction

Azure Storage always stores multiple copies of your data. When Geo-redundant Storage (GRS) is used, it is also replicated to the paired region. This way, GRS prevents that data is lost in case of disaster. However, GRS cannot prevent data loss when application errors corrupt data. Corrupted data is then just replicated to other zones/regions. In that case, a backup is needed to restore your data. Two backup strategies are as follows:

  • Snapshot creation: In case a blob is added or modified, a snapshot is created from the current situation. Because of the nature of blobs, this is an efficient O(1) operation. Snapshots can be restored quickly, however, restoring cannot always be done (e.g. …

Secure Azure Functions with Azure AD, Key Vault and VNETs. Then connect to Azure SQL using firewall rules and Managed Identity of Function.

A. Azure Functions Security - Introduction

Azure Functions is a popular tool to create small snippets of code that can execute simple tasks. Azure Functions can be triggered using queue triggers, HTTP triggers or time triggers. A typical pattern of an Azure Function is as follows:

  • Init: Retrieve state from Storage Account
  • Request: Endpoint is called by another application/user
  • Processing: Data is processed using other Azure resources
  • Response: Result is replied to caller

Pattern is depicted below, in which data is retrieved from Azure SQL and returned to application/user. …

1. Introduction

Azure Data Factory (ADFv2) is a popular tool to orchestrate data ingestion from on-premises to cloud. In every ADFv2 pipeline, security is an important topic. Common security aspects are the following:

  • Azure Active Directory (AAD) access control to data and endpoints
  • Managed Identity (MI) to prevent key management processes
  • Virtual Network (VNET) isolation of data and endpoints

In the remainder of this blog, it is discussed how an ADFv2 pipeline can be secured using AAD, MI, VNETs and firewall rules. For more details on security of Azure Functions, see my other blog. …

1. Introduction

A lot of companies consider setting up an Enterprise Data Lake. The idea is to store data in a centralized repository. In this way, it becomes easier for teams to create business value with data. To prevent that a Data Lake becomes a Data Swamp with untrusted data, metadata is key. In this, the following types of metadata are distinguished:

  • Business metadata: Data owner, data source, privacy level
  • Technical metadata: Schema name, table name, field name/type
  • Operational metadata: Timestamp, size of data, lineage

In the remainder of this blog, it is discussed how an Azure Data Lake can be set up and how metadata is added. For more details how to secure data orchestration in your Azure Data Lake, see my follow-up blog here. For a solution how to prevent data loss in your Data lake using snapshots and incremental backups, see this blog. …

1. Introduction

Edge Computing is a pattern in which part of the computation is done on decentralized edge devices and is a great way to extend cloud computing. Using this pattern, Artificial Intelligence (AI) models are trained in the cloud and deployed on the edge which has the following advantages:

  • Speed when realtime decision making is needed and cloud compute would imply too much latency
  • Availability allowing the device to function offline in case of limited connectivity to the cloud
  • Reducing bandwidth when massive amounts of data are generated and filtering is done on the device to prevent that all bandwidth is…


René Bremer

Data Solution Architect @ Microsoft, working with Azure services as ADFv2, ADLSgen2, Azure DevOps, Databricks, Function Apps and SQL. Opinions here are mine.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store