Creating a data pipeline is one thing; bringing it into production is another. This is especially true for a modern data pipeline in which multiple services are used for advanced analytics. Examples are transforming unstructured data to structured data, training of ML models and embedding OCR. Integration of multiple services can be complicated and deployment to production has to be controlled. In this blog, an example project is provided as follows:
The code from the project can be found here, the steps of the modern data pipeline are depicted below. …
The Microsoft identity platform is key to secure access to your web app. Users can authenticate to your app using their Azure AD identity or social accounts. The authorization model can be used to grant permissions to your backend app or standard APIs like Microsoft Graph. In this blog, a web application is discussed that does the following:
The code of the project can be found here, architecture can be found below. …
Python Flask is a popular tool to create web applications. Using Azure AD, users can authenticate to the REST APIs and retrieve data from Azure SQL. In this blog, a sample Python web application is created as follows:
Azure Cosmos DB is a fully managed multi-database service. It enables you to build highly responsive applications worldwide. As part of Cosmos DB, Gremlin is supported for graph databases. Since Cosmos DB is optimized for fast processing (OLTP), traversal limits may apply for heavy analytic workloads (OLAP). In that case, Azure Databricks and GraphFrames can be used as an alternative to do advanced analytics, see also architecture below.
In the remaining of blog, the following is done:
Selenium is the standard tool for automated web browser testing. On top of that, Selenium is a popular tool for web scraping. When creating a web scraper in Azure, Azure Functions is a logical candidate to run your code in. However, the default Azure Functions image does not contain the dependencies that Selenium requires. In this blog, a web scraper in Azure Functions is created as follows:
The architecture of web scraper is depicted below.
In the remaining the steps are discussed to deploy and run your web scraper in Azure Functions. For details how to secure your Azure Functions, see this blog. For details how to create a custom docker image with OpenCV in Azure Functions, see here and DockerFile here. …
Azure Storage always stores multiple copies of your data. When Geo-redundant Storage (GRS) is used, it is also replicated to the paired region. This way, GRS prevents that data is lost in case of disaster. However, GRS cannot prevent data loss when application errors corrupt data. Corrupted data is then just replicated to other zones/regions. In that case, a backup is needed to restore your data. Two backup strategies are as follows:
Secure Azure Functions with Azure AD, Key Vault and VNETs. Then connect to Azure SQL using firewall rules and Managed Identity of Function.
Azure Functions is a popular tool to create small snippets of code that can execute simple tasks. Azure Functions can be triggered using queue triggers, HTTP triggers or time triggers. A typical pattern of an Azure Function is as follows:
Pattern is depicted below, in which data is retrieved from Azure SQL and returned to application/user. …
Azure Data Factory (ADFv2) is a popular tool to orchestrate data ingestion from on-premises to cloud. In every ADFv2 pipeline, security is an important topic. Common security aspects are the following:
In the remainder of this blog, it is discussed how an ADFv2 pipeline can be secured using AAD, MI, VNETs and firewall rules. For more details on security of Azure Functions, see my other blog. …
A lot of companies consider setting up an Enterprise Data Lake. The idea is to store data in a centralized repository. In this way, it becomes easier for teams to create business value with data. To prevent that a Data Lake becomes a Data Swamp with untrusted data, metadata is key. In this, the following types of metadata are distinguished:
In the remainder of this blog, it is discussed how an Azure Data Lake can be set up and how metadata is added. For more details how to secure data orchestration in your Azure Data Lake, see my follow-up blog here. For a solution how to prevent data loss in your Data lake using snapshots and incremental backups, see this blog. …
Edge Computing is a pattern in which part of the computation is done on decentralized edge devices and is a great way to extend cloud computing. Using this pattern, Artificial Intelligence (AI) models are trained in the cloud and deployed on the edge which has the following advantages: