- Introduction
- Prerequisites
- Azure-Storage-Solutions
- Data-Orchestration
- Databricks
- DeltaLake
- mlflow
- Data Sources
- Technology
- Contributors
Project Outline:
A Re-do of Perth City Properties project using Azure Data Engineering technologies such as Azure Data Factory (ADF), Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Databricks.
In this project I'd like to:
- Add Data orchestration using Azure Data Factory
- perform data ingestion and transformation on the dataset using databricks
- Implement ML models on databricks Machine learning and keep track of changes in ML notebooks and models use mlflow. Then register the best model using MLflow Model Registry
An Azure subscription
Creating Azure Data Lake Gen2 and containers
Using Azure Storage explorer to interact with the storage account
Uploading data into raw folder
Access Control (IAM) role assignment
Integrating data from Azure Data Lake Gen2 using Azure Data Factory.
Creating dependency between pipelines to orchestrate the data flow
-
I've created 3 pipelines.
- One of them is to run the ingestion databricks notebook to get raw data and create bronze table and then ingest them into silver table.
- One to create gold table.
- This one to orchestratte the previous two pipeline. Using this I make sure To do the ingestion first and then Transformation
Branching and Chaining activities in Azure Data Factory (ADF) Pipelines using control flow activities such as Get Metadata. If Condition, ForEach, Delete, Validation etc.
Using Parameters and Variables in Pipelines, Datasets and LinkedServices to create a metadata driven pipelines in Azure Data Factory (ADF)
Debugging the data pipelines and resolving issues.
Scheduling pipelines using trigger - Tumbling Window Trigger(for past time dataset) in Azure Data Factory (ADF)
Creating ADF pipelines to execute Databricks Notebook activities to carry out transformations.
enable ADF git integration
Creating Azure Databricks Workspace
Creating Databricks cluster
Mounting storage accounts using Azure Key Vault and Databricks Secret scopes
Creating Databricks notebooks
performing transformations using Databricks notebooks
enable databricks git integration
I've build a pipeline that runs databrick notebooks to reads data into Delta tables. Using the function below, it checks if the data needs to be merged or inserted.
In this project, the pipeline reads the data from raw folder in perthpropdl (Azure Data Lake Gen2 storage) and creates a delta perth_bronze table. Then the bronze table is used in ingestion notebook to create perth_silver table. And finally gold tables are created using perth_silver table. IN this project, I didnt focus on gold table and data visulization after that. I just wanted to show the way it can be created in pipeline.
Source -> Bronze
Bronze -> Silver
Silver -> Gold
MLflow is an open source platform for managing the end-to-end machine learning lifecycle. In this project I've used these models to predict Perth Property price ranges:
- Linear Regression
- Lasso
- Ridge
- Elasticnet
Again the main purpose of this project was to show how to use mlflow. So there will be room to improve the models.
It is a centralized model repository and I used it to register the best model base on current r2 in experiment UI:
http://house.speakingsame.com/
https://www.onthehouse.com.au/