Skip to content

Commit

Permalink
feat: Initial feature-engineering-on-fabric single-tech sample check-…
Browse files Browse the repository at this point in the history
…in (#652)

* feat: initial feature-eng-on-fabric single-tech sample check-in

* doc: resize screenshots with minor contents updates

* update for data source landing

* code: update model training notebook

* update for data loading base url and relative path

* Apply suggestions from code review (Nov. 13th)

Apply current suggestions from code review by Anuj on Nov. 13th

Co-authored-by: Anuj Parashar <promisinganuj@gmail.com>

* remove App service code and update images

* for simplicity remove option 1 and send it as a footnote info no details

* fix: use Fabric workspace & lakehouse id in Purview qualified names, with minor README updates

* Fixing some linking errors

* Updated introduction and architecture description

* Updated environment setup details

* Updated 'Source Dataset' section

* Updated 'Data Activity' section

* Updated 'Data Activity' section

* Updated 'Data Activity' section

* Updated 'Data Activity' section

* doc: add contents on verifying lineage in Purview

* doc: add missed bullet to Contents table

* Updating Lineage section

* Updating Lineage section

* Updating Lineage section

* Updating Lineage section

* Updating Lineage section

* Updating 'Required resources' header

* Fixing URL checks

* Fixing URL checks

---------

Co-authored-by: Bo Wang <sonic1abc@hotmail.com>
Co-authored-by: Chenshu Cai <cchenshu5@gmail.com>
Co-authored-by: Anuj Parashar <promisinganuj@gmail.com>
  • Loading branch information
4 people authored Nov 17, 2023
1 parent e150709 commit c45d186
Show file tree
Hide file tree
Showing 81 changed files with 570 additions and 83 deletions.
15 changes: 15 additions & 0 deletions .markdownlinkcheck.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,21 @@
},
{
"pattern": "^https://sigdelta.com/blog/how-to-install-pyspark-locally/"
},
{
"pattern": "^https://stmdwpublic.blob.core.windows.net/"
},
{
"pattern": "^https://azure.microsoft.com/en-us/free/"
},
{
"pattern": "^https://azure.microsoft.com/en-us/products/data-factory/"
},
{
"pattern": "^https://dev.azure.com"
},
{
"pattern": "^https://azure.microsoft.com/en-us/free/databricks/"
}
]
}
2 changes: 2 additions & 0 deletions .markdownlint.yml
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
MD013: false
MD024: false
MD037: false
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ description: "Code samples showcasing how to apply DevOps concepts to the Modern

# DataOps for the Modern Data Warehouse

This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on [Microsoft Azure](https://azure.microsoft.com/en-au/).
This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on Microsoft Azure.

The samples are either focused on a single azure service (**Single Tech Samples**) or showcases an end to end data pipeline solution as a reference implementation (**End to End Samples**). Each sample contains code and artifacts relating one or more of the following

Expand Down Expand Up @@ -54,7 +54,7 @@ The samples are either focused on a single azure service (**Single Tech Samples*
- [**Temperature Events Solution**](e2e_samples/temperature_events) - This demonstrate a high-scale event-driven data pipeline with a focus on how to implement Observability and Load Testing.
![Architecture](e2e_samples/temperature_events/images/temperature-events-architecture.png?raw=true "Architecture")
- [**Dataset Versioning Solution**](e2e_samples/dataset_versioning) - This demonstrates how to use DataFactory to Orchestrate DataFlow, to do DeltaLoads into DeltaLake On DataLake(DoDDDoD).
- [**MDW Data Governance and PII data detection**](e2e_samples/mdw_governance) - This sample demonstrates how to deploy the Infrastructure of an end-to-end MDW Pipeline using [Azure DevOps pipelines](https://azure.microsoft.com/en-au/services/devops/pipelines/) along with a focus around Data Governance and PII data detection.
- [**MDW Data Governance and PII data detection**](e2e_samples/mdw_governance) - This sample demonstrates how to deploy the Infrastructure of an end-to-end MDW Pipeline using [Azure DevOps pipelines](https://azure.microsoft.com/en-us/products/devops/pipelines/) along with a focus around Data Governance and PII data detection.
- *Technology stack*: Azure DevOps, Azure Data Factory, Azure Databricks, Azure Purview, [Presidio](https://github.com/microsoft/presidio)

## Contributing
Expand Down
2 changes: 1 addition & 1 deletion e2e_samples/dataset_versioning/databricks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ The [official doc](https://docs.microsoft.com/en-us/azure/databricks/security/da
1. Navigating to your Storage account in the Azure Portal then clicking on `containers` -> `container(datalake)` -> `Manage ACL`
2. Add your READ and EXECUTE permission and click save.
3. [**Optional**] In case you have any existing files in the Data Lake container, you may need to propogate ACL permissions.
1. Open up [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)
1. Open up [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/products/storage/storage-explorer/)
2. Navigate to the storage account and right click on container to select **propagate access control list**.
> Propagate access control list cannot be found: Try updating azure storage explorer to the latest version.
Expand Down
2 changes: 1 addition & 1 deletion e2e_samples/deployment_stamps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ The same follow applies to Stamp2 and Stamp3 too. In this current version of sam

### Prerequisites

1. [Azure DevOps account](https://dev.azure.com/)
1. [Azure DevOps account](https://azure.microsoft.com/en-us/products/devops/)
2. [Azure Account](https://azure.microsoft.com/en-us/free/)
*Permissions needed*: ability to create and deploy to an azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group; ability to manage Azure AD to create App registration, Users, Groups and Enterprise App Registration.

Expand Down
28 changes: 14 additions & 14 deletions e2e_samples/mdw_governance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,18 @@ The following shows the architecture of the solution.

### Technologies used

- [Azure Purview](https://azure.microsoft.com/en-au/services/devops/)
- [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/)
- [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
- [Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/)
- [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/)
- [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview)
- [Office365 API Connection](https://docs.microsoft.com/en-us/azure/connectors/connectors-create-api-office365-outlook)
- [Azure Virtual Network](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview)
- [Private Endpoint](https://docs.microsoft.com/en-us/azure/private-link/private-endpoint-overview)
- [Azure Function](https://docs.microsoft.com/en-us/azure/azure-functions/)
- [Azure Logic App](https://azure.microsoft.com/en-us/services/logic-apps/)
- [Azure Private DNS](https://docs.microsoft.com/en-us/azure/dns/private-dns-overview)
- [Azure Purview](https://azure.microsoft.com/en-us/products/purview/)
- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/)
- [Azure Data Lake Gen2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction)
- [Azure Databricks](https://docs.microsoft.com/azure/databricks/)
- [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/)
- [Application Insights](https://docs.microsoft.com/azure/azure-monitor/app/app-insights-overview)
- [Office365 API Connection](https://docs.microsoft.com/azure/connectors/connectors-create-api-office365-outlook)
- [Azure Virtual Network](https://docs.microsoft.com/azure/virtual-network/virtual-networks-overview)
- [Private Endpoint](https://docs.microsoft.com/azure/private-link/private-endpoint-overview)
- [Azure Function](https://docs.microsoft.com/azure/azure-functions/)
- [Azure Logic App](https://azure.microsoft.com/products/logic-apps/)
- [Azure Private DNS](https://docs.microsoft.com/azure/dns/private-dns-overview)

## Key Learnings

Expand Down Expand Up @@ -80,8 +80,8 @@ Each environment has an identical set of resources

### Prerequisites

1. [Azure DevOps account](https://dev.azure.com/)
2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA)
1. [Azure DevOps account](https://azure.microsoft.com/en-us/products/devops/)
2. [Azure Account](https://azure.microsoft.com/en-us/free/)
- *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group.

### Setup and Deployment
Expand Down
16 changes: 8 additions & 8 deletions e2e_samples/parking_sensors/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ The sample demonstrate how DevOps principles can be applied end to end Data Pipe

## Solution Overview

The solution pulls near realtime [Melbourne Parking Sensor data](https://www.melbourne.vic.gov.au/about-council/governance-transparency/open-data/Pages/on-street-parking-data.aspx) from a publicly available REST api endpoint and saves this to [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction). It then validates, cleanses, and transforms the data to a known schema using [Azure Databricks](https://azure.microsoft.com/en-au/services/databricks/). A second Azure Databricks job then transforms these into a [Star Schema](https://en.wikipedia.org/wiki/Star_schema) which are then loaded into [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/en-au/services/synapse-analytics/) using [Polybase](https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15). The entire pipeline is orchestrated with [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/).
The solution pulls near realtime [Melbourne Parking Sensor data](https://www.melbourne.vic.gov.au/about-council/governance-transparency/open-data/Pages/on-street-parking-data.aspx) from a publicly available REST api endpoint and saves this to [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction). It then validates, cleanses, and transforms the data to a known schema using [Azure Databricks](https://azure.microsoft.com/en-us/products/databricks/). A second Azure Databricks job then transforms these into a [Star Schema](https://en.wikipedia.org/wiki/Star_schema) which are then loaded into [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/products/synapse-analytics/) using [Polybase](https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15). The entire pipeline is orchestrated with [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/).

### Architecture

Expand All @@ -82,13 +82,13 @@ See [here](#build-and-release-pipeline) for details.

It makes use of the following azure services:

- [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/)
- [Azure Databricks](https://azure.microsoft.com/en-au/services/databricks/)
- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/)
- [Azure Databricks](https://azure.microsoft.com/en-us/products/databricks/)
- [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
- [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/en-au/services/synapse-analytics/)
- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/)
- [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/products/synapse-analytics/)
- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/)
- [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview)
- [PowerBI](https://powerbi.microsoft.com/en-us/)
- [PowerBI](https://www.microsoft.com/en-us/power-platform/products/power-bi/)

For a detailed walk-through of the solution and key concepts, watch the following video recording:

Expand Down Expand Up @@ -212,9 +212,9 @@ More resources:
### Prerequisites

1. [Github account](https://github.com/)
2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA)
2. [Azure Account](https://azure.microsoft.com/en-us/free/)
- *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group.
3. [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/)
3. [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/)
- *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml).

#### Software pre-requisites if you don't use dev container<!-- omit in toc -->
Expand Down
10 changes: 5 additions & 5 deletions e2e_samples/parking_sensors_synapse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ See [here](#build-and-release-pipeline) for details.

It makes use of the following azure services:

- [Azure Synapse Analytics](https://azure.microsoft.com/en-au/services/synapse-analytics/)
- [Azure Synapse Analytics](https://azure.microsoft.com/products/synapse-analytics/)
- [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/)
- [PowerBI](https://powerbi.microsoft.com/en-us/)
- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/)
- [PowerBI](https://www.microsoft.com/en-us/power-platform/products/power-bi/)
- [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview)
- [Log Analytics](https://docs.microsoft.com/en-us/azure/azure-monitor/logs/log-analytics-overview)

Expand Down Expand Up @@ -166,9 +166,9 @@ Please check the details [here](docs/observability.md).
### Prerequisites

1. [Github account](https://github.com/)
2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA)
2. [Azure Account](https://azure.microsoft.com/en-us/free/)
- *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group.
3. [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/)
3. [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/)
- *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml). Ability to install Azure DevOps extensions (unless the required [Synapse extension](https://marketplace.visualstudio.com/items?itemName=AzureSynapseWorkspace.synapsecicd-deploy) is already installed).

#### Software pre-requisites if you don't use dev container<!-- omit in toc -->
Expand Down
12 changes: 6 additions & 6 deletions e2e_samples/temperature_events/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The sample demonstrates how Events can be processed in a streaming serverless pi
- [Azure Function logic](#azure-function-logic)
- [Device Filter](#device-filter)
- [Temperature Filter](#temperature-filter)
- [EventHub & Azure Function scaling](#eventhub--azure-function-scaling)
- [EventHub \& Azure Function scaling](#eventhub--azure-function-scaling)
- [Infrastructure as Code](#infrastructure-as-code)
- [Modularize Terraform](#modularize-terraform)
- [Isolation of Environment](#isolation-of-environment)
Expand Down Expand Up @@ -65,10 +65,10 @@ time | timestamp of value

It makes use of the following azure services:

- [Azure Event Hubs](https://azure.microsoft.com/en-us/services/event-hubs/)
- [Azure Event Hubs](https://azure.microsoft.com/en-us/products/event-hubs/)
- [Azure Functions](https://azure.microsoft.com/en-us/services/functions/)
- [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator/)
- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/)
- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/)
- [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview)
- [Terraform](https://www.terraform.io/)

Expand All @@ -90,7 +90,7 @@ There are 3 major steps to running the sample. Follow each sub-page in order:
- [Github account](https://github.com/) [Optional]
- [Azure Account](https://azure.microsoft.com/en-au/free/)
- *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group.
- [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/)
- [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/)
- *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml).
- Software
- [Azure CLI 2.18+](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest)
Expand Down Expand Up @@ -344,7 +344,7 @@ The load testing script allows you to quickly generate load against your infrast
Getting started resources:

- [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator/)
- [Azure Well Architected Framework - Performance testing](https://docs.microsoft.com/en-us/azure/architecture/framework/scalability/performance-test)
- [Azure Well Architected Framework - Performance testing](https://learn.microsoft.com/en-us/azure/well-architected/performance-efficiency/performance-test)
- [Load test for real time data processing | by Masayuki Ota](https://masayukiota.medium.com/load-test-for-real-time-data-processing-30a256a994ce)

### Load testing architecture
Expand All @@ -354,7 +354,7 @@ The load testing can be invoked by the [IoTSimulator.ps1](./loadtesting/IoTSimul
The script will orchestrate the test by automating these steps:

- Create or use an existing Resource Group
- Create an [Azure Container Instances](https://azure.microsoft.com/en-us/services/container-instances/) resource.
- Create an [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances/) resource.
- Deploy the [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator/) container, and scale it to the number of instances passed in as a parameter.
- Coordinate them to simulate the specified number of devices, and send a certain number of messages.
- Wait for it to complete.
Expand Down
Loading

0 comments on commit c45d186

Please sign in to comment.