diff --git a/.markdownlinkcheck.json b/.markdownlinkcheck.json index db0d8a9b9..7c87bce9e 100644 --- a/.markdownlinkcheck.json +++ b/.markdownlinkcheck.json @@ -8,6 +8,21 @@ }, { "pattern": "^https://sigdelta.com/blog/how-to-install-pyspark-locally/" + }, + { + "pattern": "^https://stmdwpublic.blob.core.windows.net/" + }, + { + "pattern": "^https://azure.microsoft.com/en-us/free/" + }, + { + "pattern": "^https://azure.microsoft.com/en-us/products/data-factory/" + }, + { + "pattern": "^https://dev.azure.com" + }, + { + "pattern": "^https://azure.microsoft.com/en-us/free/databricks/" } ] } diff --git a/.markdownlint.yml b/.markdownlint.yml index ff7d7cc82..30af73a58 100644 --- a/.markdownlint.yml +++ b/.markdownlint.yml @@ -1 +1,3 @@ MD013: false +MD024: false +MD037: false diff --git a/README.md b/README.md index f87e3dd8b..7b1ea81a2 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ description: "Code samples showcasing how to apply DevOps concepts to the Modern # DataOps for the Modern Data Warehouse -This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on [Microsoft Azure](https://azure.microsoft.com/en-au/). +This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the [Modern Data Warehouse (MDW)](https://azure.microsoft.com/en-au/solutions/architecture/modern-data-warehouse/) architectural pattern on Microsoft Azure. The samples are either focused on a single azure service (**Single Tech Samples**) or showcases an end to end data pipeline solution as a reference implementation (**End to End Samples**). Each sample contains code and artifacts relating one or more of the following @@ -54,7 +54,7 @@ The samples are either focused on a single azure service (**Single Tech Samples* - [**Temperature Events Solution**](e2e_samples/temperature_events) - This demonstrate a high-scale event-driven data pipeline with a focus on how to implement Observability and Load Testing. ![Architecture](e2e_samples/temperature_events/images/temperature-events-architecture.png?raw=true "Architecture") - [**Dataset Versioning Solution**](e2e_samples/dataset_versioning) - This demonstrates how to use DataFactory to Orchestrate DataFlow, to do DeltaLoads into DeltaLake On DataLake(DoDDDoD). -- [**MDW Data Governance and PII data detection**](e2e_samples/mdw_governance) - This sample demonstrates how to deploy the Infrastructure of an end-to-end MDW Pipeline using [Azure DevOps pipelines](https://azure.microsoft.com/en-au/services/devops/pipelines/) along with a focus around Data Governance and PII data detection. +- [**MDW Data Governance and PII data detection**](e2e_samples/mdw_governance) - This sample demonstrates how to deploy the Infrastructure of an end-to-end MDW Pipeline using [Azure DevOps pipelines](https://azure.microsoft.com/en-us/products/devops/pipelines/) along with a focus around Data Governance and PII data detection. - *Technology stack*: Azure DevOps, Azure Data Factory, Azure Databricks, Azure Purview, [Presidio](https://github.com/microsoft/presidio) ## Contributing diff --git a/e2e_samples/dataset_versioning/databricks/README.md b/e2e_samples/dataset_versioning/databricks/README.md index 2f297c8a2..b7fd97c03 100644 --- a/e2e_samples/dataset_versioning/databricks/README.md +++ b/e2e_samples/dataset_versioning/databricks/README.md @@ -37,7 +37,7 @@ The [official doc](https://docs.microsoft.com/en-us/azure/databricks/security/da 1. Navigating to your Storage account in the Azure Portal then clicking on `containers` -> `container(datalake)` -> `Manage ACL` 2. Add your READ and EXECUTE permission and click save. 3. [**Optional**] In case you have any existing files in the Data Lake container, you may need to propogate ACL permissions. - 1. Open up [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/) + 1. Open up [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/products/storage/storage-explorer/) 2. Navigate to the storage account and right click on container to select **propagate access control list**. > Propagate access control list cannot be found: Try updating azure storage explorer to the latest version. diff --git a/e2e_samples/deployment_stamps/README.md b/e2e_samples/deployment_stamps/README.md index 8b7356389..bafad0a8c 100644 --- a/e2e_samples/deployment_stamps/README.md +++ b/e2e_samples/deployment_stamps/README.md @@ -66,7 +66,7 @@ The same follow applies to Stamp2 and Stamp3 too. In this current version of sam ### Prerequisites -1. [Azure DevOps account](https://dev.azure.com/) +1. [Azure DevOps account](https://azure.microsoft.com/en-us/products/devops/) 2. [Azure Account](https://azure.microsoft.com/en-us/free/) *Permissions needed*: ability to create and deploy to an azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group; ability to manage Azure AD to create App registration, Users, Groups and Enterprise App Registration. diff --git a/e2e_samples/mdw_governance/README.md b/e2e_samples/mdw_governance/README.md index bf4b1e834..403011d73 100644 --- a/e2e_samples/mdw_governance/README.md +++ b/e2e_samples/mdw_governance/README.md @@ -33,18 +33,18 @@ The following shows the architecture of the solution. ### Technologies used -- [Azure Purview](https://azure.microsoft.com/en-au/services/devops/) -- [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/) -- [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) -- [Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/) -- [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) -- [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview) -- [Office365 API Connection](https://docs.microsoft.com/en-us/azure/connectors/connectors-create-api-office365-outlook) -- [Azure Virtual Network](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview) -- [Private Endpoint](https://docs.microsoft.com/en-us/azure/private-link/private-endpoint-overview) -- [Azure Function](https://docs.microsoft.com/en-us/azure/azure-functions/) -- [Azure Logic App](https://azure.microsoft.com/en-us/services/logic-apps/) -- [Azure Private DNS](https://docs.microsoft.com/en-us/azure/dns/private-dns-overview) +- [Azure Purview](https://azure.microsoft.com/en-us/products/purview/) +- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/) +- [Azure Data Lake Gen2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction) +- [Azure Databricks](https://docs.microsoft.com/azure/databricks/) +- [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/) +- [Application Insights](https://docs.microsoft.com/azure/azure-monitor/app/app-insights-overview) +- [Office365 API Connection](https://docs.microsoft.com/azure/connectors/connectors-create-api-office365-outlook) +- [Azure Virtual Network](https://docs.microsoft.com/azure/virtual-network/virtual-networks-overview) +- [Private Endpoint](https://docs.microsoft.com/azure/private-link/private-endpoint-overview) +- [Azure Function](https://docs.microsoft.com/azure/azure-functions/) +- [Azure Logic App](https://azure.microsoft.com/products/logic-apps/) +- [Azure Private DNS](https://docs.microsoft.com/azure/dns/private-dns-overview) ## Key Learnings @@ -80,8 +80,8 @@ Each environment has an identical set of resources ### Prerequisites -1. [Azure DevOps account](https://dev.azure.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +1. [Azure DevOps account](https://azure.microsoft.com/en-us/products/devops/) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. ### Setup and Deployment diff --git a/e2e_samples/parking_sensors/README.md b/e2e_samples/parking_sensors/README.md index 4c8d52d11..3a2dae57e 100644 --- a/e2e_samples/parking_sensors/README.md +++ b/e2e_samples/parking_sensors/README.md @@ -58,7 +58,7 @@ The sample demonstrate how DevOps principles can be applied end to end Data Pipe ## Solution Overview -The solution pulls near realtime [Melbourne Parking Sensor data](https://www.melbourne.vic.gov.au/about-council/governance-transparency/open-data/Pages/on-street-parking-data.aspx) from a publicly available REST api endpoint and saves this to [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction). It then validates, cleanses, and transforms the data to a known schema using [Azure Databricks](https://azure.microsoft.com/en-au/services/databricks/). A second Azure Databricks job then transforms these into a [Star Schema](https://en.wikipedia.org/wiki/Star_schema) which are then loaded into [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/en-au/services/synapse-analytics/) using [Polybase](https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15). The entire pipeline is orchestrated with [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/). +The solution pulls near realtime [Melbourne Parking Sensor data](https://www.melbourne.vic.gov.au/about-council/governance-transparency/open-data/Pages/on-street-parking-data.aspx) from a publicly available REST api endpoint and saves this to [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction). It then validates, cleanses, and transforms the data to a known schema using [Azure Databricks](https://azure.microsoft.com/en-us/products/databricks/). A second Azure Databricks job then transforms these into a [Star Schema](https://en.wikipedia.org/wiki/Star_schema) which are then loaded into [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/products/synapse-analytics/) using [Polybase](https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15). The entire pipeline is orchestrated with [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/). ### Architecture @@ -82,13 +82,13 @@ See [here](#build-and-release-pipeline) for details. It makes use of the following azure services: -- [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/) -- [Azure Databricks](https://azure.microsoft.com/en-au/services/databricks/) +- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/) +- [Azure Databricks](https://azure.microsoft.com/en-us/products/databricks/) - [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) -- [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/en-au/services/synapse-analytics/) -- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/) +- [Azure Synapse Analytics (formerly SQLDW)](https://azure.microsoft.com/products/synapse-analytics/) +- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/) - [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview) -- [PowerBI](https://powerbi.microsoft.com/en-us/) +- [PowerBI](https://www.microsoft.com/en-us/power-platform/products/power-bi/) For a detailed walk-through of the solution and key concepts, watch the following video recording: @@ -212,9 +212,9 @@ More resources: ### Prerequisites 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. -3. [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/) +3. [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/) - *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml). #### Software pre-requisites if you don't use dev container diff --git a/e2e_samples/parking_sensors_synapse/README.md b/e2e_samples/parking_sensors_synapse/README.md index 001ab5980..5032efdde 100644 --- a/e2e_samples/parking_sensors_synapse/README.md +++ b/e2e_samples/parking_sensors_synapse/README.md @@ -59,10 +59,10 @@ See [here](#build-and-release-pipeline) for details. It makes use of the following azure services: -- [Azure Synapse Analytics](https://azure.microsoft.com/en-au/services/synapse-analytics/) +- [Azure Synapse Analytics](https://azure.microsoft.com/products/synapse-analytics/) - [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) -- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/) -- [PowerBI](https://powerbi.microsoft.com/en-us/) +- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/) +- [PowerBI](https://www.microsoft.com/en-us/power-platform/products/power-bi/) - [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview) - [Log Analytics](https://docs.microsoft.com/en-us/azure/azure-monitor/logs/log-analytics-overview) @@ -166,9 +166,9 @@ Please check the details [here](docs/observability.md). ### Prerequisites 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. -3. [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/) +3. [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/) - *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml). Ability to install Azure DevOps extensions (unless the required [Synapse extension](https://marketplace.visualstudio.com/items?itemName=AzureSynapseWorkspace.synapsecicd-deploy) is already installed). #### Software pre-requisites if you don't use dev container diff --git a/e2e_samples/temperature_events/README.md b/e2e_samples/temperature_events/README.md index ba39b4065..4a7a5c06e 100644 --- a/e2e_samples/temperature_events/README.md +++ b/e2e_samples/temperature_events/README.md @@ -28,7 +28,7 @@ The sample demonstrates how Events can be processed in a streaming serverless pi - [Azure Function logic](#azure-function-logic) - [Device Filter](#device-filter) - [Temperature Filter](#temperature-filter) -- [EventHub & Azure Function scaling](#eventhub--azure-function-scaling) +- [EventHub \& Azure Function scaling](#eventhub--azure-function-scaling) - [Infrastructure as Code](#infrastructure-as-code) - [Modularize Terraform](#modularize-terraform) - [Isolation of Environment](#isolation-of-environment) @@ -65,10 +65,10 @@ time | timestamp of value It makes use of the following azure services: -- [Azure Event Hubs](https://azure.microsoft.com/en-us/services/event-hubs/) +- [Azure Event Hubs](https://azure.microsoft.com/en-us/products/event-hubs/) - [Azure Functions](https://azure.microsoft.com/en-us/services/functions/) - [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator/) -- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/) +- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/) - [Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview) - [Terraform](https://www.terraform.io/) @@ -90,7 +90,7 @@ There are 3 major steps to running the sample. Follow each sub-page in order: - [Github account](https://github.com/) [Optional] - [Azure Account](https://azure.microsoft.com/en-au/free/) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. - - [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/) + - [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/) - *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml). - Software - [Azure CLI 2.18+](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) @@ -344,7 +344,7 @@ The load testing script allows you to quickly generate load against your infrast Getting started resources: - [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator/) -- [Azure Well Architected Framework - Performance testing](https://docs.microsoft.com/en-us/azure/architecture/framework/scalability/performance-test) +- [Azure Well Architected Framework - Performance testing](https://learn.microsoft.com/en-us/azure/well-architected/performance-efficiency/performance-test) - [Load test for real time data processing | by Masayuki Ota](https://masayukiota.medium.com/load-test-for-real-time-data-processing-30a256a994ce) ### Load testing architecture @@ -354,7 +354,7 @@ The load testing can be invoked by the [IoTSimulator.ps1](./loadtesting/IoTSimul The script will orchestrate the test by automating these steps: - Create or use an existing Resource Group -- Create an [Azure Container Instances](https://azure.microsoft.com/en-us/services/container-instances/) resource. +- Create an [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances/) resource. - Deploy the [Azure IoT Device Telemetry Simulator](https://github.com/Azure-Samples/Iot-Telemetry-Simulator/) container, and scale it to the number of instances passed in as a parameter. - Coordinate them to simulate the specified number of devices, and send a certain number of messages. - Wait for it to complete. diff --git a/single_tech_samples/azuresql/README.md b/single_tech_samples/azuresql/README.md index 4e8a2854e..54f5f48c0 100644 --- a/single_tech_samples/azuresql/README.md +++ b/single_tech_samples/azuresql/README.md @@ -1,6 +1,6 @@ # Azure SQL Database -[Azure SQL Database](https://azure.microsoft.com/en-au/services/sql-database/) is a relational database commonly used in the MDW architecture, typically in the serving layer. The following samples demonstrates how you might build CI/CD pipelines to deploy changes to +[Azure SQL Database](https://azure.microsoft.com/en-us/products/azure-sql/database/) is a relational database commonly used in the MDW architecture, typically in the serving layer. The following samples demonstrates how you might build CI/CD pipelines to deploy changes to Azure SQL Database. ## Contents @@ -76,9 +76,9 @@ The following are some sample [Azure DevOps](https://docs.microsoft.com/en-us/az ### Prerequisites 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/free) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. -3. [Azure DevOps Account](https://azure.microsoft.com/en-us/services/devops/) +3. [Azure DevOps Account](https://azure.microsoft.com/en-us/products/devops/) - *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml) and [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml). #### Software Prerequisites diff --git a/single_tech_samples/databricks/sample1_basic_azure_databricks_environment/README.md b/single_tech_samples/databricks/sample1_basic_azure_databricks_environment/README.md index d4f8e6ff7..6f0ec6b05 100644 --- a/single_tech_samples/databricks/sample1_basic_azure_databricks_environment/README.md +++ b/single_tech_samples/databricks/sample1_basic_azure_databricks_environment/README.md @@ -55,9 +55,9 @@ Following are the cloud design patterns being used by this sample: The following technologies are used to build this sample: -- [Azure Databricks](https://azure.microsoft.com/en-au/free/databricks/) -- [Azure Storage](https://azure.microsoft.com/en-au/services/storage/data-lake-storage/) -- [Azure Key Vault](https://azure.microsoft.com/en-au/services/key-vault/) +- [Azure Databricks](https://azure.microsoft.com/en-us/free/databricks/) +- [Azure Storage](https://azure.microsoft.com/en-us/products/storage/data-lake-storage/) +- [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/) - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/) - [Azure Resource Manager](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview) @@ -70,7 +70,7 @@ This section holds the information about usage instructions of this sample. The following are the prerequisites for deploying this sample : 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: The ability to create and deploy to an Azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. - Active subscription with the following [resource providers](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-services-resource-providers) enabled: diff --git a/single_tech_samples/databricks/sample2_enterprise_azure_databricks_environment/README.md b/single_tech_samples/databricks/sample2_enterprise_azure_databricks_environment/README.md index 9b0629ed5..0db6bde8e 100644 --- a/single_tech_samples/databricks/sample2_enterprise_azure_databricks_environment/README.md +++ b/single_tech_samples/databricks/sample2_enterprise_azure_databricks_environment/README.md @@ -74,9 +74,9 @@ Following are the cloud design patterns being used by this sample: The following technologies are used to build this sample: -- [Azure Databricks](https://azure.microsoft.com/en-au/free/databricks/) -- [Azure Storage](https://azure.microsoft.com/en-au/services/storage/data-lake-storage/) -- [Azure Key Vault](https://azure.microsoft.com/en-au/services/key-vault/) +- [Azure Databricks](https://azure.microsoft.com/en-us/free/databricks/) +- [Azure Storage](https://azure.microsoft.com/en-us/products/storage/data-lake-storage/) +- [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/) - [Azure Virtual networks](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview) - [Azure Firewall](https://docs.microsoft.com/en-us/azure/firewall/overview) - [Azure Route tables](https://docs.microsoft.com/en-us/azure/virtual-network/manage-route-table) @@ -146,7 +146,7 @@ This section holds the information about usage instructions of this sample. The following are the prerequisites for deploying this sample : 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: The ability to create and deploy to an Azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. - Active subscription with the following [resource providers](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-services-resource-providers) enabled: diff --git a/single_tech_samples/databricks/sample4_ci_cd/README.md b/single_tech_samples/databricks/sample4_ci_cd/README.md index 18134d35b..7ec71342b 100644 --- a/single_tech_samples/databricks/sample4_ci_cd/README.md +++ b/single_tech_samples/databricks/sample4_ci_cd/README.md @@ -62,8 +62,8 @@ The below diagram illustrates the deployment process flow followed in this templ The following technologies are used to build this template: -- [Azure DevOps](https://azure.microsoft.com/en-us/services/devops/) -- [Azure Databricks](https://azure.microsoft.com/en-au/free/databricks/) +- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/) +- [Azure Databricks](https://azure.microsoft.com/en-us/free/databricks/) - [Azure Resource Manager](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview) - [nutter](https://github.com/microsoft/nutter) - [databricks cli](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/) diff --git a/single_tech_samples/dataexplorer/README.md b/single_tech_samples/dataexplorer/README.md index 497112cfc..2f31afc51 100644 --- a/single_tech_samples/dataexplorer/README.md +++ b/single_tech_samples/dataexplorer/README.md @@ -1,6 +1,6 @@ # Azure Data Explorer -[Azure Data Explorer](https://azure.microsoft.com/en-us/services/data-explorer/#features) is a fast, fully managed data analytics service for large volumes of data ingesting from applications, websites, IoT devices, and more. The following sample demonstrates how you can create CI/CD pipelines to provision Azure Data Explorer services and deploy database changes to it. +[Azure Data Explorer](https://azure.microsoft.com/en-us/products/data-explorer/#features) is a fast, fully managed data analytics service for large volumes of data ingesting from applications, websites, IoT devices, and more. The following sample demonstrates how you can create CI/CD pipelines to provision Azure Data Explorer services and deploy database changes to it. ## Contents diff --git a/single_tech_samples/datafactory/sample1_cicd/README.md b/single_tech_samples/datafactory/sample1_cicd/README.md index cd0049349..3041dc7c6 100644 --- a/single_tech_samples/datafactory/sample1_cicd/README.md +++ b/single_tech_samples/datafactory/sample1_cicd/README.md @@ -28,7 +28,7 @@ This sample demonstrates how to apply DevOps with Azure Data Factory (ADF) by so ## Solution Overview -This solution sets up an [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) storage account, with a folder structure that enables data tiering (bronze, silver, gold), and an [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/)(ADF) instance with linked services connecting to the data lake, to a separate file share and a [key vault](https://azure.microsoft.com/en-us/services/key-vault/) for secrets. +This solution sets up an [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) storage account, with a folder structure that enables data tiering (bronze, silver, gold), and an [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/)(ADF) instance with linked services connecting to the data lake, to a separate file share and a [key vault](https://azure.microsoft.com/en-us/services/key-vault/) for secrets. The Azure Data Factory contains a simple pipeline taking data from the file share and ingesting it to the bronze folder. @@ -51,10 +51,10 @@ The following shows the overall CI/CD process as built with Azure DevOps Pipelin ### Technologies used -- [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/) -- [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) -- [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) -- [Azure DevOps](https://azure.microsoft.com/en-au/services/devops/) +- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/) +- [Azure Data Lake Gen2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction) +- [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/) +- [Azure DevOps](https://azure.microsoft.com/en-us/products/devops/) - [pytest-adf](https://github.com/devlace/pytest-adf) ## Key Learnings @@ -152,9 +152,9 @@ This kicks off the PROD CI/CD pipeline. ### Prerequisites 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. -3. [Azure DevOps Project](https://azure.microsoft.com/en-us/services/devops/) +3. [Azure DevOps Project](https://azure.microsoft.com/en-us/products/devops/) - *Permissions needed*: ability to create [service connections](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops&tabs=yaml), [pipelines](https://docs.microsoft.com/en-us/azure/devops/pipelines/get-started/pipelines-get-started?view=azure-devops&tabs=yaml) and [variable groups](https://docs.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=yaml). #### Software pre-requisites diff --git a/single_tech_samples/datafactory/sample3_data_pre_processing_with_azure_batch/README.md b/single_tech_samples/datafactory/sample3_data_pre_processing_with_azure_batch/README.md index 01c36ed35..8b1613928 100644 --- a/single_tech_samples/datafactory/sample3_data_pre_processing_with_azure_batch/README.md +++ b/single_tech_samples/datafactory/sample3_data_pre_processing_with_azure_batch/README.md @@ -6,11 +6,12 @@ - [Scope](#scope) - [Use Case](#use-case) - [Architecture](#architecture) + - [**Architecture Design Components**](#architecture-design-components) - [Technologies used](#technologies-used) - [How to use this sample](#how-to-use-this-sample) - [Prerequisites](#prerequisites) - [Software Prerequisites](#software-prerequisites) - - [Setup and deployment](#setup-and-deployment) + - [Setup and deployment](#setup-and-deployment) - [Deployment validation and Execution](#deployment-validation-and-execution) - [Clean-up](#clean-up) - [Resources](#resources) @@ -99,10 +100,10 @@ The high-level design is depicted in the diagram below, showcasing the integrati The following technologies are used to build this sample: -- [Azure Data Factory](https://azure.microsoft.com/products/data-factory/) +- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/) - [Azure Batch](https://azure.microsoft.com/en-us/products/batch) -- [Azure Storage(ADLS)](https://azure.microsoft.com/en-au/services/storage/data-lake-storage/) -- [NFS Mounts](https://learn.microsoft.com/en-us/azure/storage/blobs/network-file-system-protocol-support-how-to) +- [Azure Storage(ADLS)](https://azure.microsoft.com/services/storage/data-lake-storage/) +- [NFS Mounts](https://learn.microsoft.com/azure/storage/blobs/network-file-system-protocol-support-how-to) ## How to use this sample @@ -187,10 +188,10 @@ Please follow the steps in the [clean-up section](deploy/terraform/README.md#cle ### Resources -- [Data operations for autonomous vehicle operations](https://learn.microsoft.com/en-us/azure/architecture/example-scenario/automotive/autonomous-vehicle-operations-dataops) +- [Data operations for autonomous vehicle operations](https://learn.microsoft.com/azure/architecture/example-scenario/automotive/autonomous-vehicle-operations-dataops) -- [Parallel processing with Azure Batch.](https://learn.microsoft.com/en-us/azure/batch/batch-technical-overview#run-parallel-workloads) +- [Parallel processing with Azure Batch.](https://learn.microsoft.com/azure/batch/batch-technical-overview#run-parallel-workloads) -- [Autoscaling with Azure Batch](https://learn.microsoft.com/en-us/azure/batch/batch-automatic-scaling) +- [Autoscaling with Azure Batch](https://learn.microsoft.com/azure/batch/batch-automatic-scaling) -- [Running Azure Batch from Azure Data Factory(ADF)](https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory) +- [Running Azure Batch from Azure Data Factory(ADF)](https://learn.microsoft.com/azure/batch/tutorial-run-python-batch-azure-data-factory) diff --git a/single_tech_samples/fabric/README.md b/single_tech_samples/fabric/README.md new file mode 100644 index 000000000..6c2be9712 --- /dev/null +++ b/single_tech_samples/fabric/README.md @@ -0,0 +1,9 @@ +# Microsoft Fabric + +[Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview) is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place. + +## Samples + +- [Feature Engineering on Microsoft Fabric](./feature_engineering_on_fabric/README.md) + + This sample demonstrates how to use Azure ML managed feature store and Microsoft Fabric to build a feature engineering system. It also shows how to track and monitor the data lineage of the features and the model training process using Microsoft Purview. The provided sample also encompasses data validation and exploratory data analysis (EDA) within Fabric notebooks. diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/README.md b/single_tech_samples/fabric/feature_engineering_on_fabric/README.md new file mode 100644 index 000000000..074e6407a --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/README.md @@ -0,0 +1,435 @@ +# Introduction + +[Feature engineering](https://learn.microsoft.com/azure/architecture/data-science-process/create-features#what-is-feature-engineering) is the process of selecting, transforming, or creating relevant features (variables) from raw data to improve the performance of a machine learning model. It involves extracting meaningful information, handling missing values, scaling, encoding categorical variables, and creating new features. Effective feature engineering enhances a model's ability to understand patterns, leading to better predictions and improved overall performance in data science and machine learning tasks. + +With the recent launch of Azure ML managed feature store, machine learning professionals are now empowered to develop and productionize features independently. +You simply provide a feature set specification and let the system handle serving, securing, and monitoring of your features, freeing you from the overhead of setting up and managing the underlying feature engineering pipelines. + +To learn more, read the [launch blog](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/announcing-managed-feature-store-in-azure-machine-learning/ba-p/3823043) and [Azure ML managed feature store documentation](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2). + +[Microsoft Fabric](https://learn.microsoft.com/fabric/get-started/microsoft-fabric-overview) is an all-in-one analytics solution for enterprises, and is an excellent choice for machine learning and data science tasks on Azure. The Data Science experience of Microsoft Fabric enables you to build, deploy, and operationalize machine learning models seamlessly within your Fabric experience. It integrates with Azure Machine Learning to provide built-in experiment tracking and model registry. + +To learn more, read [Data Science documentation in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-science/). + +This sample focuses on constructing a feature engineering system using Azure ML managed feature store and Microsoft Fabric. It delves into the tracking and monitoring of data lineage for these features through Microsoft Purview. The content includes a step-by-step guide for environment setup and running the demo, aiming to expedite the feature engineering process within the Azure ecosystem. + +## Contents + +- [Architecture](#architecture) +- [Source dataset](#source-dataset) +- [Environment setup](#environment-setup) + - [Prerequisites](#prerequisites) + - [Required resources](#required-resources) + - [Microsoft Fabric setup](#microsoft-fabric-setup) +- [Data pipeline setup](#data-pipeline-setup) + - [Data landing activity](#data-landing-activity) + - [Notebook activities](#notebook-activities) +- [Execute and verify](#execute-and-verify) + - [Trigger the data pipeline](#trigger-the-data-pipeline) + - [Verify lineage in Purview](#verify-lineage-in-purview) + - [Data lineage](#data-lineage) + - [Feature lineage](#feature-lineage) + - [Verify the features in Feature Store](#verify-the-features-in-feature-store) +- [Model training and inferencing](#model-training-and-inferencing) + - [Model training](#model-training) + - [Model inferencing](#model-inferencing) + - [Verify lineage in Purview](#verify-lineage-in-purview-1) + - [Model training lineage](#model-training-lineage) +- [Conclusion](#conclusion) +- [References](#references) + +## Architecture + +Here is the high-level architecture diagram: + +![Architecture](./images/fs_architecture.png) + +This architecture utilizes Microsoft Fabric as the data analytics platform. A [data pipeline](https://learn.microsoft.com/training/modules/use-data-factory-pipelines-fabric/) lands, ingests, and transform the incoming data. The transformed data is registered as [features](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2#what-are-features) in Azure ML managed feature store. These features are used for model training and inferencing. Concurrently, the data lineage of both the data pipelines and the features is tracked and monitored using Microsoft Purview. + +The sample follows a medallion architecture with `landing`, `staging` and `standard` zones created in the __File Section__ of a [lakehouse](https://learn.microsoft.com/fabric/data-engineering/lakehouse-overview) in [Fabric OneLake](https://learn.microsoft.com/fabric/onelake/onelake-overview). For data landing, 'ForEach' activity is used to download multiple files from a public blob storage. The rest of the processing (ingestion, transformation, feature registration, model training, and model inferencing) is done using Fabric 'data notebooks'. + +In addition to the main flow, there are optional steps for performing 'exploratory data analysis' and 'data validations' (illustrated by dotted lines in the diagram). These features are currently not covered as part of the step-by-step guide, but the notebooks are available in the repo for reference. + +## Source dataset + +The sample uses the public yellow taxi trip dataset from [New York City Taxi & Limousine Commission](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). However, we have hosted a subset of this dataset on our own public blob storage at the following location: + +Base URL: [https://stmdwpublic.blob.core.windows.net/](https://stmdwpublic.blob.core.windows.net/) + +This subset contains data for the year 2022, and each month is available as a separate parquet file. The data includes anonymized travel details like departure points, destinations, times, distances, and costs. The data, in conjunction with taxi zone maps and lookup tables, aids in various research fields such as identifying frequent pickup and drop-off zones in the city. + +## Environment setup + +### Prerequisites + +- Access to a [Microsoft Fabric enabled tenant](https://learn.microsoft.com/fabric/admin/fabric-switch#enable-for-your-tenant). You can sign up for Fabric (Preview) trials for a period of 60 days by following [this link](https://learn.microsoft.com/fabric/get-started/fabric-trial). +- Access to an Azure subscription. You can sign up for a free trial by following [this link](https://azure.microsoft.com/en-us/free/). +- Permissions to [create a service principal](https://learn.microsoft.com/entra/identity-platform/howto-create-service-principal-portal#register-an-application-with-microsoft-entra-id-and-create-a-service-principal) in Microsoft Entra ID. + +### Required resources + +As described above, the sample uses Microsoft Fabric as the data analytics platform, which provides SaaS based access to the required storage and compute resources. However, the following Microsoft/Azure resources, which are external to Microsoft Fabric, are still required: + +- Microsoft Purview Account + + You can [create a new account](https://learn.microsoft.com/purview/create-microsoft-purview-portal) or use an existing one. Note down the purview account name. + +- Azure ML Managed Feature Store + + You can create an [Azure ML managed feature store](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2) by running the following Azure CLI: + + ```bash + az ml feature-store create \ + --subscription \ + --resource-group \ + --location \ + --name + ``` + + To explore alternative methods for building a feature store, check out the [Microsoft documentation](https://learn.microsoft.com/azure/machine-learning/tutorial-get-started-with-feature-store?view=azureml-api-2&tabs=SDK-track#create-a-minimal-feature-store). + + Note down the subscription id, resource group, and feature store name. + +- Service Principal + + This service principal is required to access and interact with the Azure ML managed feature store and with Microsoft Purview from Fabric notebooks. Create the service principal in Microsoft Entra ID by [registering an application](https://learn.microsoft.com/purview/create-service-principal-azure#app-registration) and [adding a secret to the client credentials](https://learn.microsoft.com/purview/create-service-principal-azure#adding-a-secret-to-the-client-credentials). Note down the client id, client secret, and tenant id. + +- Grant access on Azure ML managed feature store and Purview to the service principal + + To grant managed feature store access to this service principal, go to the feature store in Azure Portal, and assign `AzureML Data Scientist` role to this SP. + + ![roles](./images/featurestore_1.png) + + For granting access to Microsoft Purview access, assign [Data curators](https://learn.microsoft.com/en-us/purview/how-to-create-and-manage-collections#roles) role in the Purview root collection to this SP. For details, refer to [Microsoft documentation](https://learn.microsoft.com/purview/how-to-create-and-manage-collections#add-role-assignments). + +### Microsoft Fabric setup + +1. Create a Microsoft Fabric workspace + + After signing into Microsoft Fabric, the very first step is to create a workspace. Workspaces contain the Fabric items such as lakehouses, warehouses, and reports. For detailed instructions about creating a Fabric workspace, refer to [this documentation](https://learn.microsoft.com/fabric/data-warehouse/tutorial-create-workspace). + + Please note that you need to select 'trial' under the 'license mode' in the advanced options, as currently only this type supports all the Microsoft Fabric features required for this sample. + +2. Create a Fabric lakehouse + + After the lakehouse is created, go back to the workspace. Click the 'New' button from the **Data Engineering** or the **Data Science** homepage and select 'Import notebook' from the menu. Navigate to `src/notebooks` folder under the repo, and select all the notebooks to import: + +3. Import the sample notebooks + + After the lakehouse is created, go back to the workspace. Click the 'New' button again and select 'Import notebook' from the menu. Navigate to `src/notebooks` folder under the repo, and select all the notebooks to import: + + ![import all notebooks](./images/import_all_notebooks.png) + +4. Add the created lakehouse to the imported notebooks + + After importing, open the following notebooks and add the lakehouse created in the previous step to them: + + - [data_cleansing](./src/notebooks/data_cleansing.ipynb) + - [data_ingestion](./src/notebooks/data_ingestion.ipynb) + - [data_transformation](./src/notebooks/data_transformation.ipynb) + - [data_validation](./src/notebooks/data_validation.ipynb) + - [exploratory_data_analysis](./src/notebooks/exploratory_data_analysis.ipynb) + - [feature_set_registration](./src/notebooks/feature_set_registration.ipynb) + + ![add lakehouse](./images/add_lh_to_notebook.png) + +5. Create and setup a Fabric 'Environment' + + Go to Fabric workspace homepage and create a new __Environment__ by clicking '+ New' button and selecting `Environment (Preview)` under 'Data Engineering' or 'Data Science' experience. + + ![new](./images/featurestore_5.png) + ![env](./images/featurestore_4.png) + + In this environment, you will setup the python environment under 'Public Libraries'. This can be done by adding `azureml-featurestore` package using PyPI, or by clicking 'Add from yml' and selecting `./src/environment/Publiclibrary.yml` from this repo. + + ![pip](./images/featurestore_3.png) + + You will also need to edit the Spark properties. The yaml template with the required properties is available at `./src/environment/sparkProperties.yaml`. Replace the value with the resources that you created. + + ```yaml + runtime_version: '1.1' + spark_conf: + - spark.fsd.client_id: + - spark.fsd.tenant_id: + - spark.fsd.subscription_id: + - spark.fsd.rg_name: + - spark.fsd.name: + - spark.fsd.fabric.tenant: # Fetch from Fabric base URL, like https://.powerbi.com/ + - spark.fsd.purview.account: + ``` + + After this is done, click 'Save' to save the environment, and then 'Publish' to publish it. This may take a few minutes to finish. + + Finally, To apply the environment, you can set the newly created environment as 'default' in the Fabric workspace settings page. + + ![pip](./images/fabric_env_2.png) + + Or you can apply to each specific notebook in the notebook edition page. + + ![pip](./images/fabric_env_1.png) + +## Data pipeline setup + +This is the main data pipeline which contains all the activities required for data landing, ingestion, cleansing, transformation, and feature registration. Start by creating a new data pipeline in the workspace. + +![data_pipeline_01](./images/data_pipeline/data_pipeline_01.png) + +### Data landing activity + +As mentioned in the introduction, a 'ForEach' activity is used for data landing. This activity downloads multiple files from a public blob storage. Add the 'ForEach' activity to the data pipeline [^1]. + +![data_pipeline_02](./images/data_pipeline/data_pipeline_02.png) + +The configuration of this 'ForEach' activity follows a series of steps as below. + +- Add copy data activity + + Select the 'ForEach' activity that was just created and click '+' to add a 'Copy data' activity. + + ![data_pipeline_03](./images/data_pipeline/data_pipeline_03.png) + +- Define source and destination for 'Copy data' activity + + Select the 'Copy data' activity and switch to the 'Source' tab. Choose 'external' as the 'Data store type'. For 'Connection', click '+' to create a new HTTP link. On the creation page, enter as the server URL. Click the 'Create' button to complete the setup. + + ![data_pipeline_04](./images/data_pipeline/data_pipeline_04.png) + +- Define pipeline parameters + + Return to the pipeline design panel and click anywhere on the blank area. Then, select 'Parameters' and add the following three parameters: + + - __URLs__ + + Type: Array + + Default Value: + + ``` json + ["datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-01.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-02.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-03.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-04.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-05.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-06.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-07.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-08.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-09.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-10.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-11.parquet", + "datasets/nyc-yellow-tripdata-2022/yellow_tripdata_2022-12.parquet", + "datasets/nyc-yellow-tripdata-2022/taxi_zone_lookup.csv"] + ``` + + - __landing_path__ + + Type: String + + Default Value: `01_landing` + + - __client_secret__ + + Type: SecureString + + Default Value: _The client secret of the service principal that you created earlier._ + + ![data_pipeline_05](./images/data_pipeline/data_pipeline_05.png) + +- Configure 'Settings' for 'ForEach' activity + + Return to the Settings tab of the 'ForEach' activity and set the Batch count value to 20. This limits the maximum number of files to be downloaded concurrently. For 'Items', select the parameter 'URLs', or directly enter `@pipeline().parameters.URLs`. + + ![data_pipeline_06](./images/data_pipeline/data_pipeline_06.png) + +- Review 'Source' settings for 'Copy data' activity + + Return to the 'Source' tab of the 'Copy data' activity within the 'ForEach' activity. In the connection, select the HTTP link that was created before. The 'Connection type' should be 'HTTP'. For 'Relative URL', enter '@item()' and select 'Binary' as the 'File format'. This completes the configuration of the Source in the Copy data Activity. + + ![data_pipeline_07](./images/data_pipeline/data_pipeline_07.png) + +- Review 'Destination' settings for 'Copy data' activity + + Switch to the 'Destination' tab and enter the configuration as shown in the image. Note that if you have not already created a lakehouse, you need to create a new one in the workspace first and select it here. If you have already created the lakehouse, select that. The 'File path', use the following values: + + ```text + File Path: @pipeline().parameters.landing_path + File name: @last(array(split(item(),'/'))) + ``` + + ![data_pipeline_08](./images/data_pipeline/data_pipeline_08.png) + +- Test the pipeline + + Run this pipeline with the single 'Data Landing' activity. If everything is properly configured, you shall should be able to see the downloaded data in the lakehouse as shown below. + + ![data_pipeline_09](./images/data_pipeline/data_pipeline_09.png) + + ![data_pipeline_10](./images/data_pipeline/data_pipeline_10.png) + +### Notebook activities + +Once the 'Data Landing' activity is configured, we can start to add more activities to perform the data ingestion, cleansing, transformation, and feature set registration tasks. Each of these tasks is performed by a 'notebook' activity which invokes the corresponding notebook: + +- [data_ingestion](./src/notebooks/data_ingestion.ipynb) +- [data_cleansing](./src/notebooks/data_cleansing.ipynb) +- [data_transformation](./src/notebooks/data_transformation.ipynb) +- [feature_set_registration](./src/notebooks/feature_set_registration.ipynb) + +When we finish configuring the whole pipeline, it should look like this. + +![data_pipeline_11](./images/data_pipeline/data_pipeline_11.png) + +Here are the steps involved in configuring these notebook activities. + +- Add four 'Notebook' activities + + Select 'Activities' tab and add four 'Notebook' activities. + + ![data_pipeline_12](./images/data_pipeline/data_pipeline_12.png) + +- Link the activities + + Link the activities within the pipeline. Make sure to drag the 'On success' check mark located on the right side of each activity to the next one. It would ensure that the subsequent activity will be executed only when the current one has completed successfully. + + ![data_pipeline_13](./images/data_pipeline/data_pipeline_13.png) + +- Configure the 'Notebook' activities + + Rename each notebook activity by selecting it, clicking 'General' tab and updating the name of the 'Notebook' activity. Name them as 'Data Ingestion', 'Data Cleansing', 'Data Transformation', and 'Feature Registration' respectively. + + For each notebook activity, add the following 'Base parameters' in the Settings' tab: + + ```text + Name: client_secret + Type: String + Value: `@pipeline().parameters.client_secret` + ``` + +- Associate actual notebooks with the activities + + Within the 'Settings' tab, select our current workspace and then choose the right Notebook file. + + ```text + Data Ingestion -> data_ingestion + Data Cleansing -> data_cleansing + Data Transformation -> data_transformation + Feature Registration -> feature_set_registration + ``` + + ![data_pipeline_15](./images/data_pipeline/data_pipeline_15.png) + +## Execute and verify + +Once the setup is complete, you can trigger the pipeline and verify the results. + +### Trigger the data pipeline + +Access the Fabric workspace and locate the item categorized as a `Data pipeline`. + +![data pipeline](./images/pipeline_image.png) + +In order to trigger the data pipeline, choose this data pipeline, and click `Run`. + +![data pipeline](./images/data_pipeline.png) + +In the pane, you can check the input and output, which gives the more detailed execution result. + +![pipeline run](./images/pipeline_run.png) + +### Verify lineage in Purview + +Once the Fabric data pipeline has executed successfully, the data pipeline and feature lineage can be checked in Microsoft Purview. + +#### Data lineage + +- Open the [Microsoft Purview Governance Portal](https://learn.microsoft.com/en-us/purview/use-microsoft-purview-governance-portal) of your Purview account. + +- Go to _Data Catalog -> Home_ and type _Fabric notebook_ in the search field. In the _Asset suggestions_ list, you shall be able to see some Fabric notebook items that have been executed via the Fabric data pipeline. Click on one of the items such as _data_ingestion (Fabric notebook)_. To know more about searching the Purview catalog, refer to [this documentation](https://learn.microsoft.com/purview/how-to-search-catalog). + +- Go to the _Properties_ tab of the asset page. There you can view inputs and outputs assets of this notebook asset, as well as the qualified name of this asset. Via this link, the Fabric notebook can be opened directly in a new browser tab. + +- In the _Lineage_ tab, you can see the lineage view of the whole data processing lifecycle that was executed as part of this demo. + +- Click any node in the lineage view and then click the _Switch to asset_ link in the lower-left to navigate to another asset to check more details. + + ![process_asset_lineage_view](./images/data_lineage/data_lineage_in_purview.gif) + +#### Feature lineage + +- In the lineage view of the transformed data asset, you can see three downstream assets, which are features registered in Azure ML managed feature store. + +- Switch to one of the features. In the _Properties_ tab, you can see the data type of the feature as well as feature set each feature belongs to. + +- Clicking the feature set link takes it to the related feature set asset. In the _Properties_ tab, you can tell what features it contains, which feature store it belongs to, as well as the qualified name. This link can be clicked to open the feature set view of Azure ML managed feature store in a new browser tab. + +- Go to the _Related_ tab of the feature set asset. It shows the asset type hierarchy view of feature store relevant asset types. Click the _features_ node to get all available features displayed in the upper-left of the canvas, or click the _featurestore_ node to get the feature store details. Then click the link to navigate to another interested asset. + + ![feature_lineage](./images/data_lineage/feature_lineage.gif) + +### Verify the features in Feature Store + +If the pipeline executes successfully, you can verify the features registered in the Azure ML Managed Feature Store. + +- Login to Azure ML Studio, and select _Feature stores_ tab from the left menu. +- Find the feature store under _Feature stores_ that you created and click on the link. +- Select _Feature sets_ tab from the left menu, and you can see the feature set that you registered. Click on the link. +- You can see the features that you registered under the _Features_ tab. You can also see the feature set specification under the _Specification_ tab. +- Similarly, you can also check the entities from the _Entities_ tab. + + ![feature_lineage](./images/managed_feature_store.gif) + +## Model training and inferencing + +So far in this sample, the source data has been ingested, cleansed, transformed, and registered as features in Azure ML managed feature store. The data lineage of each processing step has also been registered in Microsoft Purview. Now, we can use these features to train a machine learning model. + +Even though these can be included in the same data pipeline, the training and inferencing notebooks are run separately as distinct operations for the purpose of this sample. + +### Model training + +The model training notebook is available at [model_training](./src/notebooks/model_training.ipynb). This notebook uses the features registered in the previous step to train a machine learning model. Like previous notebooks, it also registers the lineage in Microsoft Purview. But this time, it's the model training lineage. + +> Note: the model_training will need to retrieve data from the feature store, which requires credential to access the feature store. Make sure the `client_secret` parameter is set in the notebook `feature_set_retrieval`. +> + +To run the notebook, Open it and and click `Run all`. The model will be trained and registered as an `ML model` in the Fabric workspace. + +![machine learning models](./images/model_type.png) + +For each ML model, different versions can be tracked with different parameters and metrics. + +![ml model list](./images/model_list.png) + +### Model inferencing + +The model inferencing notebook is available at [model_inferencing](./src/notebooks/model_inferencing.ipynb). This notebook uses the 'ML Model' that was registered and trained during the model training phase (previous step) to perform inferencing. + +Go to model_inferencing notebook, and click `Run all`. Once the execution is complete, the prediction results would be generated. + +![inferencing result](./images/inferencing_result.png) + +### Verify lineage in Purview + +Once the training and inferencing notebooks have been executed successfully, the model training lineage can be checked in Microsoft Purview. + +#### Model training lineage + +- Go to _Data Catalog -> Home_ page, and type _Fabric notebook_ in the search field, click the 'model_training (Fabric notebook)' item in the _Asset suggestions_ list. + +- Go to the _Lineage_ tab, we can see several extra assets appending to the end of the previous lineage path, which shows the lineage of the trained model; this can tell what features are used in the model training. You can navigate to the actual model training notebook via the qualified name to check more details, or go to the model training experiment page in Fabric to get more experiment runs details. + +- Switch to the _demand_prediction_model_ asset, in the _Properties_ tab. It shows the model version and related experiment run name. + + ![ml_model_training_lineage](./images/data_lineage/model_training_lineage.gif) + +## Conclusion + +Congratulations on completing this sample! By now, you would have learned how to use Azure ML managed feature store and Microsoft Fabric to build a feature engineering system. You would have also learned how to track and monitor the data lineage of the features and the model training process using Microsoft Purview. + +Feel free to check out the product documentation links in the references section below to learn more about the features and services used in this sample. + +## References + +- [Micrsoft Fabric](https://learn.microsoft.com/fabric/get-started/microsoft-fabric-overview) +- [Azure Machine Learning Managed Feature Store](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2) +- [Azure Purview](https://azure.microsoft.com/en-us/products/purview/) + +[^1]: For Data Landing, we can also use another approach, which is to use ADLS (Azure Data Lake Storage) or Blob storage. First, download these files to your local, then upload these documents to ADLS. Then use Copy Data Activity through a link service to the storage, so we can also complete the same task. diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/add_lh_to_notebook.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/add_lh_to_notebook.png new file mode 100644 index 000000000..108f8e519 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/add_lh_to_notebook.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/data_lineage_in_purview.gif b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/data_lineage_in_purview.gif new file mode 100644 index 000000000..b972bd0a6 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/data_lineage_in_purview.gif differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_lineage.gif b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_lineage.gif new file mode 100644 index 000000000..78ca8469a Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_lineage.gif differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_properties_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_properties_view.png new file mode 100644 index 000000000..d69eb303d Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_properties_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_store_lineage_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_store_lineage_view.png new file mode 100644 index 000000000..54ddb7334 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/feature_store_lineage_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/featureset_properties_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/featureset_properties_view.png new file mode 100644 index 000000000..6ac3edfa5 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/featureset_properties_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/featureset_related_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/featureset_related_view.png new file mode 100644 index 000000000..20bcc72c1 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/featureset_related_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/ml_model_lineage_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/ml_model_lineage_view.png new file mode 100644 index 000000000..3859de04b Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/ml_model_lineage_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/ml_model_properties_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/ml_model_properties_view.png new file mode 100644 index 000000000..4376aedca Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/ml_model_properties_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/model_training_lineage.gif b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/model_training_lineage.gif new file mode 100644 index 000000000..8a289800a Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/model_training_lineage.gif differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/process_asset_lineage_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/process_asset_lineage_view.png new file mode 100644 index 000000000..95a2f163b Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/process_asset_lineage_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/process_asset_properties_view.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/process_asset_properties_view.png new file mode 100644 index 000000000..41135ce60 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_lineage/process_asset_properties_view.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline.png new file mode 100644 index 000000000..bd369ec66 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/ADLS_source_file.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/ADLS_source_file.png new file mode 100644 index 000000000..8838ec860 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/ADLS_source_file.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_01.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_01.png new file mode 100644 index 000000000..245fff8cd Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_01.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_02.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_02.png new file mode 100644 index 000000000..f2e56d63f Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_02.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_03.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_03.png new file mode 100644 index 000000000..6eee98d7d Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_03.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_04.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_04.png new file mode 100644 index 000000000..689086af5 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_04.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_05.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_05.png new file mode 100644 index 000000000..1072e7b01 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_05.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_06.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_06.png new file mode 100644 index 000000000..20b34040e Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_06.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_07.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_07.png new file mode 100644 index 000000000..19b87f58f Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_07.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_08.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_08.png new file mode 100644 index 000000000..a0f9df896 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_08.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_09.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_09.png new file mode 100644 index 000000000..eb8496ac3 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_09.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_10.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_10.png new file mode 100644 index 000000000..b03b55ee6 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_10.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_11.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_11.png new file mode 100644 index 000000000..321a6b992 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_11.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_12.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_12.png new file mode 100644 index 000000000..f11443c8b Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_12.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_13.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_13.png new file mode 100644 index 000000000..b2d14b06a Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_13.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_15.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_15.png new file mode 100644 index 000000000..99ff9ee8a Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/data_pipeline_15.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/workspace_01.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/workspace_01.png new file mode 100644 index 000000000..288742238 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/workspace_01.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/workspace_02.jpg b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/workspace_02.jpg new file mode 100644 index 000000000..9785ffe52 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/data_pipeline/workspace_02.jpg differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/fabric_env_1.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/fabric_env_1.png new file mode 100644 index 000000000..b1241db97 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/fabric_env_1.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/fabric_env_2.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/fabric_env_2.png new file mode 100644 index 000000000..77dfb156e Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/fabric_env_2.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_1.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_1.png new file mode 100644 index 000000000..f620f40c3 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_1.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_2.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_2.png new file mode 100644 index 000000000..555ffeb63 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_2.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_3.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_3.png new file mode 100644 index 000000000..5af851393 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_3.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_4.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_4.png new file mode 100644 index 000000000..80df3a99d Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_4.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_5.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_5.png new file mode 100644 index 000000000..d3ae8884b Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/featurestore_5.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/fs_architecture.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/fs_architecture.png new file mode 100644 index 000000000..7361139af Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/fs_architecture.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/import_all_notebooks.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/import_all_notebooks.png new file mode 100644 index 000000000..2a0e841b1 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/import_all_notebooks.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/inferencing_result.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/inferencing_result.png new file mode 100644 index 000000000..fb5a44d1f Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/inferencing_result.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/managed_feature_store.gif b/single_tech_samples/fabric/feature_engineering_on_fabric/images/managed_feature_store.gif new file mode 100644 index 000000000..a4e498b4c Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/managed_feature_store.gif differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/model_list.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/model_list.png new file mode 100644 index 000000000..4f99b8613 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/model_list.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/model_type.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/model_type.png new file mode 100644 index 000000000..f83c505ef Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/model_type.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/pipeline_image.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/pipeline_image.png new file mode 100644 index 000000000..97c515c44 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/pipeline_image.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/images/pipeline_run.png b/single_tech_samples/fabric/feature_engineering_on_fabric/images/pipeline_run.png new file mode 100644 index 000000000..e1d1c0f60 Binary files /dev/null and b/single_tech_samples/fabric/feature_engineering_on_fabric/images/pipeline_run.png differ diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/environment/Publiclibrary.yml b/single_tech_samples/fabric/feature_engineering_on_fabric/src/environment/Publiclibrary.yml new file mode 100644 index 000000000..19f2c3b15 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/environment/Publiclibrary.yml @@ -0,0 +1,3 @@ +dependencies: + - pip: + - azureml-featurestore==0.1.0b5 diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/environment/sparkProperties.yml b/single_tech_samples/fabric/feature_engineering_on_fabric/src/environment/sparkProperties.yml new file mode 100644 index 000000000..2c341ef45 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/environment/sparkProperties.yml @@ -0,0 +1,9 @@ +runtime_version: '1.1' +spark_conf: + - spark.fsd.client_id: + - spark.fsd.tenant_id: + - spark.fsd.subscription_id: + - spark.fsd.rg_name: + - spark.fsd.name: + - spark.fsd.fabric.tenant: # Fetch from Fabric base URL, like https://.powerbi.com/ + - spark.fsd.purview.account: diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_catalog_and_lineage.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_catalog_and_lineage.ipynb new file mode 100644 index 000000000..57e34dcca --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_catalog_and_lineage.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"code","execution_count":null,"id":"ce9d5633-c1d6-4dbe-9478-a5fe2e1c4a4d","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["class SPCredentials():\n"," CLIENT_ID = spark.conf.get(\"spark.fsd.client_id\")\n"," CLIENT_SECRET = client_secret\n"," TENANT_ID = spark.conf.get(\"spark.fsd.tenant_id\")\n"]},{"cell_type":"code","execution_count":null,"id":"fa4bc977-04f0-4c78-9922-d09946ff4380","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["PURVIEW_ACCOUNT = spark.conf.get(\"spark.fsd.purview.account\")"]},{"cell_type":"code","execution_count":null,"id":"e633f5aa-a174-4bd5-a2e8-fc61e9686c8e","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["import json\n","import requests\n","from typing import Dict, List\n","\n","class DataAsset:\n"," \"\"\" DataAsset class to describe data asset\n"," \"\"\"\n","\n"," def __init__(self,\n"," name: str,\n"," asset_type: str,\n"," fully_qualified_name: str,\n"," custom_properties: Dict = None,\n"," relationship_attributes: Dict = None):\n"," self.name = name\n"," self.type = asset_type\n"," self.fully_qualified_name = fully_qualified_name\n"," self.custom_properties = custom_properties\n"," self.relationship_attributes = relationship_attributes\n"]},{"cell_type":"code","execution_count":null,"id":"ddf42aad-23cd-4abc-beb9-1b913839cfd6","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["class DataLineage:\n"," \"\"\" DataLineage class to describe data assets and their relationships in a data pipeline\n"," \"\"\"\n","\n"," def __init__(self,\n"," input_data_assets: List[DataAsset],\n"," output_data_assets: List[DataAsset],\n"," process_asset: DataAsset = None):\n"," self.input_data_assets = input_data_assets\n"," self.process_asset = process_asset\n"," self.output_data_assets = output_data_assets\n"]},{"cell_type":"code","execution_count":null,"id":"7519f5ce-8381-47d8-bd26-e7d5148cabba","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["class CustomTypes():\n"," AZURE_ML_MANAGED_FEATURE_STORE = {\n"," \"entityDefs\": [\n"," {\n"," \"category\": \"ENTITY\",\n"," \"name\": \"Azure_ML_Managed_Feature_Store\",\n"," \"description\": \"Azure ML Managed Feature Store\",\n"," \"typeVersion\": \"1.0\",\n"," \"superTypes\": [\n"," \"DataSet\"\n"," ],\n"," \"attributeDefs\": [\n"," {\n"," \"name\": \"tags\",\n"," \"typeName\": \"map\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," }\n"," ]\n"," }\n"," ]\n"," }\n","\n"," AZURE_ML_MANAGED_FEATURE_STORE_FEATURESET = {\n"," \"entityDefs\": [\n"," {\n"," \"category\": \"ENTITY\",\n"," \"name\": \"Azure_ML_Managed_Feature_Store_Featureset\",\n"," \"description\": \"Azure ML Managed Feature Store Featureset\",\n"," \"typeVersion\": \"1.0\",\n"," \"superTypes\": [\n"," \"DataSet\"\n"," ],\n"," \"attributeDefs\": [\n"," {\n"," \"name\": \"version\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"entities\",\n"," \"typeName\": \"array\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SET\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 2147483647,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"stage\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"materialization\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"tags\",\n"," \"typeName\": \"map\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," }\n"," ]\n"," }\n"," ]\n"," }\n","\n"," AZURE_ML_MANAGED_FEATURE_STORE_FEATURE = {\n"," \"entityDefs\": [\n"," {\n"," \"category\": \"ENTITY\",\n"," \"name\": \"Azure_ML_Managed_Feature_Store_Feature\",\n"," \"description\": \"Azure ML Managed Feature Store Feature\",\n"," \"typeVersion\": \"1.0\",\n"," \"superTypes\": [\n"," \"DataSet\"\n"," ],\n"," \"attributeDefs\": [\n"," {\n"," \"name\": \"data_type\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"tags\",\n"," \"typeName\": \"map\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," }\n"," ]\n"," }\n"," ]\n"," }\n","\n"," AZURE_ML_EXPERIMENT = {\n"," \"entityDefs\": [\n"," {\n"," \"category\": \"ENTITY\",\n"," \"name\": \"Azure_ML_Experiment\",\n"," \"description\": \"Azure ML Experiment\",\n"," \"typeVersion\": \"1.0\",\n"," \"superTypes\": [\n"," \"DataSet\"\n"," ]\n"," }\n"," ]\n"," }\n","\n"," AZURE_ML_MODEL = {\n"," \"entityDefs\": [\n"," {\n"," \"category\": \"ENTITY\",\n"," \"name\": \"Azure_ML_Model\",\n"," \"description\": \"Azure ML Model\",\n"," \"typeVersion\": \"1.0\",\n"," \"superTypes\": [\n"," \"DataSet\"\n"," ],\n"," \"attributeDefs\": [\n"," {\n"," \"name\": \"version\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"experimentRunID\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"experimentRunName\",\n"," \"typeName\": \"string\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," },\n"," {\n"," \"name\": \"tags\",\n"," \"typeName\": \"map\",\n"," \"isOptional\": True,\n"," \"cardinality\": \"SINGLE\",\n"," \"valuesMinCount\": 0,\n"," \"valuesMaxCount\": 1,\n"," \"isUnique\": False,\n"," \"isIndexable\": False,\n"," \"includeInNotification\": False\n"," }\n"," ]\n"," }\n"," ]\n"," }\n","\n"," AZURE_ML_MANAGED_FEATURE_STORE_FEATURESETS = {\n"," \"relationshipDefs\": [\n"," {\n"," \"category\": \"RELATIONSHIP\",\n"," \"name\": \"azure_ml_managed_feature_store_featuresets\",\n"," \"description\": \"Azure ML Managed Feature Store contains Featuresets\",\n"," \"typeVersion\": \"1.0\",\n"," \"serviceType\": \"atlas_core\",\n"," \"lastModifiedTS\": \"1\",\n"," \"attributeDefs\": [],\n"," \"relationshipCategory\": \"COMPOSITION\",\n"," \"propagateTags\": \"NONE\",\n"," \"endDef1\": {\n"," \"type\": \"Azure_ML_Managed_Feature_Store\",\n"," \"name\": \"featuresets\",\n"," \"isContainer\": True,\n"," \"cardinality\": \"SET\",\n"," \"isLegacyAttribute\": False\n"," },\n"," \"endDef2\": {\n"," \"type\": \"Azure_ML_Managed_Feature_Store_Featureset\",\n"," \"name\": \"featurestore\",\n"," \"isContainer\": False,\n"," \"cardinality\": \"SINGLE\",\n"," \"isLegacyAttribute\": False\n"," }\n"," }\n"," ]\n"," }\n","\n"," AZURE_ML_MANAGED_FEATURE_STORE_FEATURESET_FEATURES = {\n"," \"relationshipDefs\": [\n"," {\n"," \"name\": \"azure_ml_managed_feature_store_featureset_features\",\n"," \"description\": \"Azure MFS featureset contains features\",\n"," \"serviceType\": \"atlas_core\",\n"," \"relationshipCategory\": \"COMPOSITION\",\n"," \"endDef1\": {\n"," \"type\": \"Azure_ML_Managed_Feature_Store_Featureset\",\n"," \"name\": \"features\",\n"," \"isContainer\": True,\n"," \"cardinality\": \"SET\"\n"," },\n"," \"endDef2\": {\n"," \"type\": \"Azure_ML_Managed_Feature_Store_Feature\",\n"," \"name\": \"featureset\",\n"," \"isContainer\": False,\n"," \"cardinality\": \"SINGLE\"\n"," }\n"," }\n"," ]\n"," }\n"]},{"cell_type":"code","execution_count":null,"id":"098f8d4b-cc0c-4ae7-9865-5e8fb8508cf7","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["class PurviewClient():\n"," \"\"\"Calling Purview REST APIs to complete below tasks:\n"," 1. Create data entities\n"," 2. Create/update lineage for Purview entities/assets\n"," 3. Create custom entity or relationship types\n"," \"\"\"\n"," FAILED_FETCHING_ACCESS_TOKEN_MSG = \"[Purview Exception] Failed fetching access token.\"\n"," FAILED_CREATING_ENTITY_TYPE_MSG = \"[Purview Exception] Failed creating custom entity type.\" \n"," FAILED_CREATING_ENTITY_MSG = \"[Purview Exception] Failed creating entity.\"\n"," FAILED_UPDATING_ENTITY_LINEAGE_MSG = \"[Purview Exception] Failed updating lineage of entity.\"\n"," FAILED_GETTING_ENTITY_DETAILS_MSG = \"[Purview Exception] Failed getting entity details.\"\n","\n"," def __init__(self, tenant_id, client_id, client_secret, purview_account):\n"," self._tenant_id = tenant_id\n"," self._client_id = client_id\n"," self._client_secret = client_secret\n"," self._purview_account = purview_account\n"," self._purview_rest_api_base_url = f\"https://{purview_account}.purview.azure.com\"\n"," self._access_token = self._get_access_token()\n","\n","\n"," def _get_access_token(self):\n"," \"\"\"Get access token by service principals\n"," \"\"\"\n"," fetch_token_url = f\"https://login.microsoftonline.com/{self._tenant_id}/oauth2/v2.0/token\"\n"," headers = {\"Content-Type\": \"application/x-www-form-urlencoded\"}\n"," body = {\n"," \"grant_type\": \"client_credentials\",\n"," \"client_id\": self._client_id,\n"," \"client_secret\": self._client_secret,\n"," \"scope\": \"https://purview.azure.net/.default\"\n"," }\n","\n"," response = requests.post(\n"," url=fetch_token_url,\n"," headers=headers,\n"," data=body)\n","\n"," if response.status_code == 200:\n"," access_token = json.loads(response.text)[\"access_token\"]\n"," return access_token\n"," else:\n"," errors = response.content\n"," raise Exception(f\"{self.FAILED_FETCHING_ACCESS_TOKEN_MSG} error_message: {errors}\")\n","\n","\n"," def create_entity(self, data_asset: DataAsset):\n"," \"\"\"Create data entity\n","\n"," Parameters\n"," ----------\n"," data_asset: DataAsset\n"," Data asset object\n"," \"\"\"\n"," create_entity_url = (f\"{self._purview_rest_api_base_url}\"\n"," f\"/catalog/api/collections/{self._purview_account}/entity?api-version=2022-03-01-preview\")\n"," headers = {\"Authorization\": f\"Bearer {self._access_token}\", \"Content-Type\": \"application/json\"}\n"," entity_qualified_name = data_asset.fully_qualified_name\n"," custom_properties = data_asset.custom_properties\n"," relationship_attributes = data_asset.relationship_attributes\n","\n"," entity_type_name = \"\"\n"," if data_asset.type.lower() in [\"csv\", \"txt\", \"tsv\", \"parquet\"]:\n"," entity_type_name = \"azure_datalake_gen2_path\"\n"," elif data_asset.type.lower() == \"model\":\n"," entity_type_name = \"machine_learning_models\"\n"," elif data_asset.type.lower() == \"delta\":\n"," entity_type_name = \"azure_datalake_gen2_resource_set\"\n"," elif data_asset.type.lower() == \"feature\":\n"," entity_type_name = \"Azure_ML_Managed_Feature_Store_Feature\"\n"," elif data_asset.type.lower() == \"featureset\":\n"," entity_type_name = \"Azure_ML_Managed_Feature_Store_Featureset\"\n"," elif data_asset.type.lower() == \"featurestore\":\n"," entity_type_name = \"Azure_ML_Managed_Feature_Store\"\n"," elif data_asset.type.lower() == \"ml_experiment\":\n"," entity_type_name = \"Azure_ML_Experiment\"\n"," elif data_asset.type.lower() == \"ml_model\":\n"," entity_type_name = \"Azure_ML_Model\"\n","\n"," body = {\n"," \"entity\": {\n"," \"typeName\": entity_type_name,\n"," \"attributes\": {\n"," \"qualifiedName\": entity_qualified_name,\n"," \"name\": data_asset.name\n"," },\n"," \"source\": \"Fabric\",\n"," \"relationshipAttributes\": {\n"," },\n"," \"labels\": []\n"," }\n"," }\n","\n"," # Add more attributes for file type data assets\n"," if entity_type_name in [\"azure_datalake_gen2_path\", \"machine_learning_models\"]:\n"," body[\"entity\"][\"attributes\"][\"isFile\"] = True\n"," body[\"entity\"][\"attributes\"][\"path\"] = f\"Files/{entity_qualified_name.split('/Files/')[1]}\"\n"," body[\"entity\"][\"attributes\"][\"size\"] = self._get_file_size(entity_qualified_name)\n","\n"," # Add customer properties if it's not null\n"," if custom_properties:\n"," for property_key in custom_properties.keys():\n"," body[\"entity\"][\"attributes\"][property_key] = custom_properties[property_key]\n","\n"," # Configure relationship attributes if it's not null\n"," if relationship_attributes:\n"," for relationship_attribute in relationship_attributes:\n"," relationship_attribute_type = relationship_attribute[\"type\"]\n"," type_name = None\n"," if relationship_attribute_type == \"featureset\":\n"," type_name = \"Azure_ML_Managed_Feature_Store_Featureset\"\n"," elif relationship_attribute_type == \"featurestore\":\n"," type_name = \"Azure_ML_Managed_Feature_Store\"\n","\n"," # Configure relationship attribute of cardinality type 'SINGLE'\n"," if type_name:\n"," body[\"entity\"][\"relationshipAttributes\"][relationship_attribute_type] = {\n"," \"typeName\": type_name,\n"," \"uniqueAttributes\": {\n"," \"qualifiedName\": relationship_attribute[\"qualified_name\"]\n"," }\n"," }\n","\n"," # Configure relationship attribute of cardinality type 'SET'\n"," if relationship_attribute_type == \"sources\":\n"," type_name = \"DataSet\"\n"," body[\"entity\"][\"relationshipAttributes\"][relationship_attribute_type] = [{\n"," \"typeName\": type_name,\n"," \"uniqueAttributes\": {\n"," \"qualifiedName\": relationship_attribute[\"qualified_name\"]\n"," }\n"," }]\n","\n"," response = requests.post(\n"," url=create_entity_url,\n"," headers=headers,\n"," data=json.dumps(body, default=str))\n","\n"," if response.status_code == 200:\n"," entity_guid = None\n"," response_json = json.loads(response.text)\n"," mutated_entities = response_json.get(\"mutatedEntities\", None)\n"," if mutated_entities and \"CREATE\" in mutated_entities:\n"," entity_guid = mutated_entities[\"CREATE\"][0][\"guid\"]\n"," print(f\"Created data asset with guid: {entity_guid}\")\n"," elif \"guidAssignments\" in response_json:\n"," for key, value in response_json[\"guidAssignments\"].items():\n"," entity_guid = value\n"," print(f\"Target data asset has existed with guid: {entity_guid}\")\n","\n"," return entity_guid\n"," else:\n"," errors = response.content\n"," raise Exception(f\"{self.FAILED_CREATING_ENTITY_MSG} error_message: {errors}\")\n","\n","\n"," def create_lineage(self,\n"," source_entity_guids: List[DataAsset],\n"," sink_entity_guids: List[DataAsset],\n"," process_data_asset: DataAsset = None):\n"," \"\"\"Add/update lineage of a Purview entity\n","\n"," Parameters\n"," ----------\n"," process_data_asset: DataAsset\n"," Process dataAsset\n","\n"," source_entity_guids: List[DataAsset]\n"," List of Purview entity guids of source/input entities\n","\n"," sink_entity_guids: List[DataAsset]\n"," List of Purview entity guids of sink/output entities\n"," \"\"\"\n"," update_entity_url = (f\"{self._purview_rest_api_base_url}\"\n"," \"/catalog/api/atlas/v2/entity\")\n"," headers = {\"Authorization\": f\"Bearer {self._access_token}\", \"Content-Type\": \"application/json\"}\n"," source_entities = []\n"," for source_guid in source_entity_guids:\n"," source_entities.append({\n"," \"guid\": source_guid\n"," })\n","\n"," sink_entities = []\n"," for sink_guid in sink_entity_guids:\n"," sink_entities.append({\n"," \"guid\": sink_guid\n"," })\n","\n"," body = {\n"," \"entity\": {\n"," \"version\": 0,\n"," \"typeName\": \"Process\",\n"," \"attributes\": {\n"," \"inputs\": source_entities,\n"," \"outputs\": sink_entities,\n"," \"qualifiedName\": process_data_asset.fully_qualified_name,\n"," \"name\": process_data_asset.name\n"," }\n"," }\n"," }\n","\n"," response = requests.post(\n"," url=update_entity_url,\n"," headers=headers,\n"," data=json.dumps(body, default=str))\n","\n"," if response.status_code == 200:\n"," entity_guid = None\n"," mutated_entities = json.loads(response.text).get(\"mutatedEntities\", None)\n"," if mutated_entities and \"CREATE\" in mutated_entities:\n"," updated_details = mutated_entities[\"CREATE\"][0]\n"," entity_guid = updated_details[\"guid\"]\n"," \n"," print(f\"Created data lineage with process entity guid: {entity_guid}\")\n"," return entity_guid\n"," else:\n"," errors = response.content\n"," error_message = f\"{self.FAILED_UPDATING_ENTITY_LINEAGE_MSG} error_message: {errors}\"\n"," print(error_message)\n"," raise Exception(error_message)\n","\n"," def create_custom_type(self, custom_type_dict: Dict, custom_type: str):\n"," \"\"\"Add/update Purview custom entity or relationship type\n","\n"," Parameters\n"," ----------\n"," custom_type_dict: Dict\n"," Custom type definition\n"," custom_type: str\n"," entityDefs or relationshipDefs type\n"," \"\"\"\n"," custom_type_url = (f\"{self._purview_rest_api_base_url}\"\n"," \"/catalog/api/atlas/v2/types/typedefs\")\n"," headers = {\"Authorization\": f\"Bearer {self._access_token}\", \"Content-Type\": \"application/json\"}\n","\n"," response = requests.post(\n"," url=custom_type_url,\n"," headers=headers,\n"," data=json.dumps(custom_type_dict, default=str))\n","\n"," if response.status_code == 200:\n"," custom_type_guid = None\n"," custom_type_defs = json.loads(response.text).get(custom_type, None)\n"," if custom_type_defs:\n"," for custom_type_def in custom_type_defs:\n"," print(f\"Created custom type {custom_type_def['name']} with guid: {custom_type_def['guid']}\")\n"," elif response.status_code == 409:\n"," print(json.loads(response.text).get(\"errorMessage\", \"\"))\n"," else:\n"," errors = response.content\n"," error_message = f\"{self.FAILED_CREATING_ENTITY_TYPE_MSG} error_message: {errors}\"\n"," print(error_message)\n"," raise Exception(error_message)\n","\n"," def _get_file_size(self, file_path):\n"," \"\"\" Get file size for the given Fabric OneLake path\n","\n"," Parameters\n"," ----------\n"," file_path: str\n"," OneLake file path in\n"," \"\"\"\n"," file_info = mssparkutils.fs.ls(file_path)\n"," file_size = file_info[0].size\n","\n"," return file_size\n"]},{"cell_type":"code","execution_count":null,"id":"56de4750-5620-4cb4-8f5e-d2e3145f3f95","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from azure.ai.ml.entities import FeatureSet\n","\n","class PurviewDataCatalog:\n"," \"\"\" Purview Data Catalog class, with operation methods on Purview data catalog objects and their lineages\n"," \"\"\"\n"," def __init__(self):\n"," tenant_id = SPCredentials.TENANT_ID\n"," sp_client_id = SPCredentials.CLIENT_ID\n"," sp_client_secret = SPCredentials.CLIENT_SECRET\n"," purview_account = PURVIEW_ACCOUNT\n","\n"," self._purview_client = PurviewClient(tenant_id, sp_client_id, sp_client_secret, purview_account)\n","\n"," def register_entity(self, data_asset: DataAsset):\n"," \"\"\"Register data asset to Purview\n","\n"," Args:\n"," data_asset: (DataAsset): data asset object\n"," \"\"\"\n"," self._purview_client.create_entity(data_asset)\n","\n"," def register_lineage(self, data_lineage: DataLineage):\n"," \"\"\"Register lineage of data pipeline to Purview\n","\n"," Args:\n"," data_lineage: (DataLineage): data lineage object\n"," \"\"\"\n"," input_data_assets = data_lineage.input_data_assets\n"," output_data_assets = data_lineage.output_data_assets\n"," process_asset = data_lineage.process_asset\n","\n"," input_guids = []\n"," for data_asset in input_data_assets:\n"," print(f\"Creating data asset {data_asset.name}\")\n"," data_asset_guid = self._purview_client.create_entity(data_asset)\n"," input_guids.append(data_asset_guid)\n","\n"," output_guids = []\n"," for data_asset in output_data_assets:\n"," print(f\"Creating data asset {data_asset.name}\")\n"," data_asset_guid = self._purview_client.create_entity(data_asset)\n"," output_guids.append(data_asset_guid)\n","\n"," # Call create_lineage method if process asset is not null, otherwise the connection between source and sink assets has been maintained in sink assets\n"," if process_asset:\n"," self._purview_client.create_lineage(input_guids, output_guids, process_asset)\n","\n"," def prepare_feature_assets(self,\n"," featurestore_name: str,\n"," featureset: FeatureSet,\n"," target_features: List,\n"," **kwargs):\n"," \"\"\"Prepare feature assets from features list\n","\n"," Args:\n"," featurestore_name: (str): featurestore name\n"," featureset: (FeatureSet): MFS features list\n"," target_features: (List): target features for registration\n"," \"\"\" \n"," tenant_id = kwargs.get(\"tenant_id\", \"\")\n"," subscription_id = kwargs.get(\"subscription_id\", \"\")\n"," resource_group = kwargs.get(\"resource_group\", \"\")\n","\n"," # Create custom types for Azure ML feature store, feature set and feature\n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_MANAGED_FEATURE_STORE, \"entityDefs\")\n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_MANAGED_FEATURE_STORE_FEATURESET, \"entityDefs\")\n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_MANAGED_FEATURE_STORE_FEATURE, \"entityDefs\")\n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_MANAGED_FEATURE_STORE_FEATURESETS, \"relationshipDefs\")\n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_MANAGED_FEATURE_STORE_FEATURESET_FEATURES, \"relationshipDefs\")\n","\n"," # Register feature store entity\n"," featureset_name = featureset.name\n"," featurestore_qualified_name = f\"https://ml.azure.com/featureStore/{featurestore_name}?tid={tenant_id}&wsid=/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{featurestore_name}\"\n"," featurestore_entity = DataAsset(featurestore_name,\n"," \"featurestore\",\n"," featurestore_qualified_name)\n"," self._purview_client.create_entity(featurestore_entity)\n","\n"," # Register feature set entity\n"," entity_list = [entity.name for entity in fset.entities]\n"," featureset_qualified_name = f\"https://ml.azure.com/featureStore/{featurestore_name}/featureSets/{featureset_name}/{featureset.version}/details?wsid=/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{featurestore_name}&tid={tenant_id}\"\n"," featureset_entity = DataAsset(featureset_name,\n"," \"featureset\",\n"," featureset_qualified_name,\n"," custom_properties={\"entities\": entity_list},\n"," relationship_attributes=[{\n"," \"type\": \"featurestore\",\n"," \"qualified_name\": featurestore_qualified_name\n"," }])\n"," self._purview_client.create_entity(featureset_entity)\n","\n"," feature_transformation_code = featureset.feature_transformation_code\n"," source_qualified_name = featureset.source.path\n"," features_list = [{\"feature_name\": feature.name, \"type\": feature.type} for feature in featureset.features if feature.name in target_features]\n","\n"," feature_assets = []\n"," for feature in features_list:\n"," feature_name = feature[\"feature_name\"]\n"," data_type = getattr(feature[\"type\"], \"value\")\n","\n"," if feature_transformation_code:\n"," feature_entity = DataAsset(feature_name,\n"," \"feature\",\n"," f\"{featureset_name}#{feature_name}\",\n"," custom_properties={\"data_type\": data_type},\n"," relationship_attributes=[{\n"," \"type\": \"featureset\",\n"," \"qualified_name\": featureset_qualified_name\n"," }])\n"," else: # Use sources relationship attribute to connect to source entity/asset for those passthrough features \n"," feature_entity = DataAsset(feature_name,\n"," \"feature\",\n"," f\"{featureset_name}#{feature_name}\",\n"," custom_properties={\"data_type\": data_type},\n"," relationship_attributes=[{\n"," \"type\": \"featureset\",\n"," \"qualified_name\": featureset_qualified_name\n"," },\n"," {\n"," \"type\": \"sources\",\n"," \"qualified_name\": source_qualified_name\n"," }])\n","\n"," feature_assets.append(feature_entity)\n","\n"," return feature_assets\n","\n"," def prepare_aml_custom_types(self):\n"," \"\"\"Prepare custom types for Azure machine learning experiments and models\n"," \"\"\" \n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_EXPERIMENT, \"entityDefs\")\n"," self._purview_client.create_custom_type(CustomTypes.AZURE_ML_MODEL, \"entityDefs\")\n"]}],"metadata":{"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python"},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_cleansing.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_cleansing.ipynb new file mode 100644 index 000000000..73effcab1 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_cleansing.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"f6cc8419-b0c9-448a-a510-901e14519b7c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Parameter Setup"]},{"cell_type":"code","execution_count":null,"id":"70870927-0f1d-486a-873a-f4c1d3cceeae","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"tags":["parameters"]},"outputs":[],"source":["fabric_tenant = spark.conf.get(\"spark.fsd.fabric.tenant\")\n"]},{"cell_type":"markdown","id":"a98f3aac-4908-4f7f-b54e-eb28a5e11a38","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Load ingested data from staging zone"]},{"cell_type":"code","execution_count":null,"id":"0b57611a-575c-484e-b668-51e5fdadc824","metadata":{},"outputs":[],"source":["import pandas as pd\n","\n","# Load Yellow Taxi Trip Records parquet file from staging zone to pandas dataframe\n","year = \"2022\"\n","staging_path = \"02_staging\"\n","\n","pd_df = pd.read_parquet(f\"/lakehouse/default/Files/{staging_path}/yellow_taxi_tripdata_{year}.parquet\", engine=\"pyarrow\")\n","pd_df.head()\n"]},{"cell_type":"code","execution_count":null,"id":"f38d4394-076c-46c5-86fe-3c54eca92080","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Load location zones data from landing zone\n","landing_path = \"01_landing\"\n","zones_df = pd.read_csv(f\"/lakehouse/default/Files/{landing_path}/taxi_zone_lookup.csv\")\n","zones_df.head()\n"]},{"cell_type":"markdown","id":"1a167bbe-f0b1-4351-b18c-93fe3bbbd480","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Data cleansing"]},{"cell_type":"code","execution_count":null,"id":"f7dc8421-e1b5-4ede-8869-c661ee03b2a7","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Remove rows with null passenger_count\n","pd_df = pd_df.dropna(subset=[\"passenger_count\"])\n","pd_df.isnull().sum()\n"]},{"cell_type":"code","execution_count":null,"id":"7a2304ea-4269-4e77-95f1-af079cafc61b","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Remove rows with unknown location ID (264 and 265) for PULocationID and DOLocationID columns\n","pd_df = pd_df.drop(pd_df[\"PULocationID\"].loc[(pd_df[\"PULocationID\"] == 264) | (pd_df[\"PULocationID\"] == 265)].index)\n","pd_df = pd_df.drop(pd_df[\"DOLocationID\"].loc[(pd_df[\"DOLocationID\"] == 264) | (pd_df[\"DOLocationID\"] == 265)].index)\n"]},{"cell_type":"code","execution_count":null,"id":"ba4bdfcb-6f55-4e92-addf-fb34095186ee","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Remove rows with null values for service_zone column of zones data \n","zones_df = zones_df.dropna(subset=[\"service_zone\"])\n","zones_df.isnull().sum()\n"]},{"cell_type":"markdown","id":"12bf196f-5096-4850-bb87-5c34ad685838","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Sink cleaned data to standardization zone"]},{"cell_type":"code","execution_count":null,"id":"3d354eb6-44b5-482c-b9da-03c0516529d5","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Sink concatenated trip records to 03_standard path of Fabric OneLake\n","standard_path = \"03_standard\"\n","mssparkutils.fs.mkdirs(f\"Files/{standard_path}\")\n","\n","pd_df.to_parquet(f\"/lakehouse/default/Files/{standard_path}/cleaned_yellow_taxi_tripdata_{year}.parquet\")\n","zones_df.to_parquet(f\"/lakehouse/default/Files/{standard_path}/nyc_zones.parquet\")\n"]},{"cell_type":"markdown","id":"0d1c6ab9-fef3-46de-b394-a2994855e71c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Register data assets and lineage of data pipeline to Purview"]},{"cell_type":"code","execution_count":null,"id":"2f7cb196-1b42-4f17-9d5f-9a05cc9b3d57","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run data_catalog_and_lineage"]},{"cell_type":"code","execution_count":null,"id":"1f8f353c-4b41-4c53-b276-f68338408399","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run utils"]},{"cell_type":"code","execution_count":null,"id":"898ba087-d7af-4da2-aab3-e1236b8fefa7","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["purview_data_catalog = PurviewDataCatalog()\n","\n","fabric_onelake_tenant, fabric_workspace_id, fabric_lakehouse_id = get_onelake_info()\n","onelake_base_path = f\"abfss://{fabric_workspace_id}@{fabric_onelake_tenant}.dfs.fabric.microsoft.com/{fabric_lakehouse_id}/Files\"\n","\n","# Create source data assets list\n","source_data_assets = []\n","trip_data_source_file = f\"yellow_taxi_tripdata_{year}.parquet\"\n","source_data_asset_1 = DataAsset(trip_data_source_file,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{staging_path}/{trip_data_source_file}\")\n","\n","zones_data_source_file = f\"taxi_zone_lookup.csv\" \n","source_data_asset_2 = DataAsset(zones_data_source_file,\n"," \"csv\",\n"," f\"{onelake_base_path}/{landing_path}/{zones_data_source_file}\")\n","source_data_assets.append(source_data_asset_1)\n","source_data_assets.append(source_data_asset_2)\n","\n","# Create sink data assets list\n","sink_data_assets = []\n","cleaned_trip_data_file = f\"cleaned_yellow_taxi_tripdata_{year}.parquet\"\n","sink_data_asset_1 = DataAsset(cleaned_trip_data_file,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{standard_path}/{cleaned_trip_data_file}\")\n","\n","cleaned_zones_data = \"nyc_zones.parquet\"\n","sink_data_asset_2 = DataAsset(cleaned_zones_data,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{standard_path}/{cleaned_zones_data}\")\n","\n","sink_data_assets.append(sink_data_asset_1)\n","sink_data_assets.append(sink_data_asset_2)\n","\n","# Create process data asset\n","current_notebook_context = mssparkutils.notebook.nb.context\n","notebook_id = current_notebook_context[\"currentNotebookId\"]\n","# notebook_name = current_notebook_context[\"currentNotebookName\"]\n","process_data_asset = DataAsset(\"data_cleansing (Fabric notebook)\",\n"," \"process\",\n"," f\"https://{fabric_tenant}.powerbi.com/groups/{fabric_workspace_id}/synapsenotebooks/{notebook_id}\")\n","\n","# Create lineage for data pipeline\n","data_pipeline_lineage = DataLineage(source_data_assets, sink_data_assets, process_data_asset)\n","\n","# Register lineage of data pipeline to Purview\n","purview_data_catalog.register_lineage(data_pipeline_lineage)\n"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python"},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_ingestion.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_ingestion.ipynb new file mode 100644 index 000000000..84b0f6c2d --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_ingestion.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"87da1331-3fb3-47c3-b5d4-ee563b743fd2","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Parameters Setup"]},{"cell_type":"code","execution_count":null,"id":"a726f1c0-3576-483c-82c8-4cfbcfd3d373","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"tags":["parameters"]},"outputs":[],"source":["fabric_tenant = spark.conf.get(\"spark.fsd.fabric.tenant\")\n"]},{"cell_type":"markdown","id":"c4b83221-142c-49c4-ad3c-75c8fc6ae84b","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Load landed Yellow Taxi Trip Records data from landing zone"]},{"cell_type":"code","execution_count":null,"id":"63969bb3-0fc5-4270-954c-ce982718004a","metadata":{},"outputs":[],"source":["import pandas as pd\n","\n","# Load Yellow Taxi Trip Records source parquet files to pandas dataframe\n","year = \"2022\"\n","landing_path = \"01_landing\"\n","pd_df = pd.DataFrame()\n","\n","for index in range(1, 13):\n"," month = str(index).zfill(2)\n"," monthly_df = pd.DataFrame(pd.read_parquet(f\"/lakehouse/default/Files/{landing_path}/yellow_tripdata_{year}-{month}.parquet\", engine=\"pyarrow\"))\n","\n"," pd_df = pd.concat([pd_df, monthly_df], ignore_index=True)\n","\n","pd_df.count()\n"]},{"cell_type":"code","execution_count":null,"id":"92b764db-c232-4196-9c92-f86248604b33","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Sink concatenated trip records to 02_staging path of Fabric OneLake\n","staging_path = \"02_staging\"\n","mssparkutils.fs.mkdirs(f\"Files/{staging_path}\")\n","\n","pd_df.to_parquet(f\"/lakehouse/default/Files/{staging_path}/yellow_taxi_tripdata_{year}.parquet\")\n"]},{"cell_type":"markdown","id":"e16b2477-ee2d-43cd-8ba7-613685f0871b","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Register data assets and lineage of data pipeline to Purview"]},{"cell_type":"code","execution_count":null,"id":"c0888428-ecaf-4e0d-98c2-e37aab4f2cab","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run data_catalog_and_lineage"]},{"cell_type":"code","execution_count":null,"id":"18808285-9a94-4a9f-8187-68b4bb7b7e70","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run utils"]},{"cell_type":"code","execution_count":null,"id":"ad6cbcac-9a4f-4ea5-928e-6c35b19bc87f","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["purview_data_catalog = PurviewDataCatalog()\n","\n","fabric_onelake_tenant, fabric_workspace_id, fabric_lakehouse_id = get_onelake_info()\n","onelake_base_path = f\"abfss://{fabric_workspace_id}@{fabric_onelake_tenant}.dfs.fabric.microsoft.com/{fabric_lakehouse_id}/Files\"\n","\n","# Create source data assets list\n","source_data_assets = []\n","for index in range(1, 13):\n"," month = str(index).zfill(2)\n"," source_file_name = f\"yellow_tripdata_{year}-{month}.parquet\"\n"," source_data_asset = DataAsset(source_file_name,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{landing_path}/{source_file_name}\")\n","\n"," source_data_assets.append(source_data_asset)\n","\n","# Create sink data assets list\n","sink_file_name = f\"yellow_taxi_tripdata_{year}.parquet\"\n","sink_data_asset = DataAsset(sink_file_name,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{staging_path}/{sink_file_name}\")\n","\n","sink_data_assets = [sink_data_asset]\n","\n","# Create process data asset\n","current_notebook_context = mssparkutils.notebook.nb.context\n","notebook_id = current_notebook_context[\"currentNotebookId\"]\n","# notebook_name = current_notebook_context[\"currentNotebookName\"]\n","process_data_asset = DataAsset(\"data_ingestion (Fabric notebook)\",\n"," \"process\",\n"," f\"https://{fabric_tenant}.powerbi.com/groups/{fabric_workspace_id}/synapsenotebooks/{notebook_id}\")\n","\n","# Create lineage for data pipeline\n","data_pipeline_lineage = DataLineage(source_data_assets, sink_data_assets, process_data_asset)\n","\n","# Register lineage of data pipeline to Purview\n","purview_data_catalog.register_lineage(data_pipeline_lineage)\n"]},{"cell_type":"markdown","id":"ef102f69-9f6b-4429-831e-0dd4877a46ec","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## "]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python","ms_spell_check":{"ms_spell_check_language":"en"}},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_transformation.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_transformation.ipynb new file mode 100644 index 000000000..968fe0393 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_transformation.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"cc86929d-e563-43f4-8a9f-9567100f3381","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Parameter Setup"]},{"cell_type":"code","execution_count":null,"id":"d60b6aa9-2f23-497c-a2df-a82ca1b0cea9","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"tags":["parameters"]},"outputs":[],"source":["fabric_tenant = spark.conf.get(\"spark.fsd.fabric.tenant\")\n"]},{"cell_type":"markdown","id":"2525717a-a04d-43eb-9a95-20c10abfd29c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Load cleaned data from standardization zone"]},{"cell_type":"code","execution_count":null,"id":"0e3b7f39-343d-4898-9896-7357dc6466b7","metadata":{},"outputs":[],"source":["import pandas as pd\n","import numpy as np\n","\n","# Load cleaned Yellow Taxi Trip Records parquet file from standardization zone to pandas dataframe\n","year = \"2022\"\n","standard_path = \"03_standard\"\n","\n","pd_df = pd.read_parquet(f\"/lakehouse/default/Files/{standard_path}/cleaned_yellow_taxi_tripdata_{year}.parquet\", engine=\"pyarrow\")\n","pd_df.head()\n"]},{"cell_type":"code","execution_count":null,"id":"29a12eaf-098f-46b8-93ed-fbb82789a1c1","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Load location zones data from standardization zone\n","zones_df = pd.read_parquet(f\"/lakehouse/default/Files/{standard_path}/nyc_zones.parquet\")\n","zones_df.head()\n"]},{"cell_type":"markdown","id":"2f99d28d-ca94-423b-853c-a7ce0c867d96","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Data transformation"]},{"cell_type":"code","execution_count":null,"id":"12a0b630-7ef2-469b-9989-2f50a1d21d71","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Derive month columns from pickup and dropoff datetime columns\n","pd_df[\"month_pickup\"] = pd_df[\"tpep_pickup_datetime\"].dt.month.astype(np.uint8)\n","pd_df[\"month_dropoff\"] = pd_df[\"tpep_dropoff_datetime\"].dt.month.astype(np.uint8)\n","\n","# Derive weekday columns\n","pd_df[\"weekday_pickup\"] = pd_df[\"tpep_pickup_datetime\"].dt.weekday.astype(np.uint8)\n","pd_df[\"weekday_dropoff\"] = pd_df[\"tpep_dropoff_datetime\"].dt.weekday.astype(np.uint8)\n","\n","# Derive hour of day columns\n","pd_df[\"hour_pickup\"] = pd_df[\"tpep_pickup_datetime\"].dt.hour.astype(np.uint8)\n","pd_df[\"hour_dropoff\"] = pd_df[\"tpep_dropoff_datetime\"].dt.hour.astype(np.uint8)\n","\n","pd_df.head()\n"]},{"cell_type":"code","execution_count":null,"id":"ccb9064a-b34c-4bce-b865-924730ca751c","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Join location zones data with trip data\n","pd_df = pd_df.join(zones_df.set_index(\"LocationID\"), on=\"PULocationID\")\n","pd_df.head()\n"]},{"cell_type":"code","execution_count":null,"id":"a7772656-f515-4fc9-84f9-2ec2d7269aa7","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Aggregate taxi demand by hour_pickup, weekday_pickup and borough columns\n","transformed_df = pd_df.groupby([\"hour_pickup\", \"weekday_pickup\", \"month_pickup\", \"Borough\"]).agg(\n"," demand=('hour_pickup', 'size'),\n"," pickup_timestamp=('tpep_pickup_datetime', 'first')\n",").reset_index()\n"]},{"cell_type":"code","execution_count":null,"id":"8cc29368-dac9-45c0-b4fe-67a4cb779776","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Encoding Borough columns\n","borough_array = zones_df[\"Borough\"].unique()\n","\n","transformed_df[\"borough_id\"] = list(map(lambda x: list(borough_array).index(x), transformed_df[\"Borough\"]))\n"]},{"cell_type":"code","execution_count":null,"id":"4afad66a-002b-4dfd-928c-6542b762d57d","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from sklearn.preprocessing import MinMaxScaler\n","\n","# Scaling demand column with minmax scaler\n","scaler = MinMaxScaler()\n","transformed_df[\"scaled_demand\"] = scaler.fit_transform(transformed_df[[\"demand\"]])\n","transformed_df\n"]},{"cell_type":"markdown","id":"3c7ce6b4-9722-45a9-9826-3cbd27a6847c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Sink transformed data to standardization zone"]},{"cell_type":"code","execution_count":null,"id":"879de3de-ff44-49c5-957f-d1b6c102794b","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Sink transformed trip records to 03_standard path of Fabric OneLake\n","mssparkutils.fs.mkdirs(f\"Files/{standard_path}\")\n","\n","transformed_df.to_csv(f\"/lakehouse/default/Files/{standard_path}/transformed_yellow_taxi_tripdata_{year}.csv\")\n"]},{"cell_type":"markdown","id":"39427db2-95a7-4e2b-ba82-5c4b5607b4e0","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Register data assets and lineage of target data pipeline to Purview"]},{"cell_type":"code","execution_count":null,"id":"4aaae441-9c3d-4bfa-82ea-d5ba54da7ec8","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run data_catalog_and_lineage"]},{"cell_type":"code","execution_count":null,"id":"08f9e7a6-5e16-43f4-ac05-a84645d6e6f7","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run utils"]},{"cell_type":"code","execution_count":null,"id":"6026aa9c-7a8a-4866-b9d2-f28ed924a765","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["purview_data_catalog = PurviewDataCatalog()\n","\n","fabric_onelake_tenant, fabric_workspace_id, fabric_lakehouse_id = get_onelake_info()\n","onelake_base_path = f\"abfss://{fabric_workspace_id}@{fabric_onelake_tenant}.dfs.fabric.microsoft.com/{fabric_lakehouse_id}/Files\"\n","\n","# Create source data assets list\n","source_data_assets = []\n","trip_data_source_file = f\"cleaned_yellow_taxi_tripdata_{year}.parquet\"\n","source_data_asset_1 = DataAsset(trip_data_source_file,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{standard_path}/{trip_data_source_file}\")\n","\n","zones_data_source_file = f\"nyc_zones.parquet\" \n","source_data_asset_2 = DataAsset(zones_data_source_file,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{standard_path}/{zones_data_source_file}\")\n","source_data_assets.append(source_data_asset_1)\n","source_data_assets.append(source_data_asset_2)\n","\n","# Create sink data assets list\n","sink_data_assets = []\n","transformed_trip_data_file = f\"transformed_yellow_taxi_tripdata_{year}.csv\"\n","sink_data_asset = DataAsset(transformed_trip_data_file,\n"," \"parquet\",\n"," f\"{onelake_base_path}/{standard_path}/{transformed_trip_data_file}\")\n","sink_data_assets.append(sink_data_asset)\n","\n","# Create process data asset\n","current_notebook_context = mssparkutils.notebook.nb.context\n","notebook_id = current_notebook_context[\"currentNotebookId\"]\n","# notebook_name = current_notebook_context[\"currentNotebookName\"]\n","process_data_asset = DataAsset(\"data_transformation (Fabric notebook)\",\n"," \"process\",\n"," f\"https://{fabric_tenant}.powerbi.com/groups/{fabric_workspace_id}/synapsenotebooks/{notebook_id}\")\n","\n","# Create lineage for data pipeline\n","data_pipeline_lineage = DataLineage(source_data_assets, sink_data_assets, process_data_asset)\n","\n","# Register lineage of data pipeline to Purview\n","purview_data_catalog.register_lineage(data_pipeline_lineage)\n"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python","ms_spell_check":{"ms_spell_check_language":"en"}},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_validation.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_validation.ipynb new file mode 100644 index 000000000..1ab6736a1 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/data_validation.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"6d7ac038-7884-4ee7-8817-353a213f20c8","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## Use Great Expectations For Data Validation"]},{"cell_type":"code","execution_count":null,"id":"775d88c0-f88d-49ba-8219-e0951e782ece","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%pip install great-expectations"]},{"cell_type":"code","execution_count":null,"id":"96fd341d-c774-4a92-840e-300803d27246","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%pip install opencensus-ext-azure"]},{"cell_type":"markdown","id":"55f91914-c5f4-469e-acca-0b66787a1e9c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Prepare Data Source"]},{"cell_type":"code","execution_count":null,"id":"957bd000-e467-4d87-bc05-05b6a4c8ae9e","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["import pandas as pd\n","# Load Yellow Taxi Trip Records parquet file from staging zone to pandas dataframe\n","year = \"2022\"\n","staging_path = \"02_staging\"\n","\n","pd_df = pd.read_parquet(f\"/lakehouse/default/Files/{staging_path}/yellow_taxi_tripdata_{year}.parquet\", engine=\"pyarrow\")\n","pd_df.head()"]},{"cell_type":"markdown","id":"35be2e3a-c34d-4c27-a832-01ff3f5cd581","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## 1. Configure Data Context\n","https://docs.greatexpectations.io/docs/terms/data_context"]},{"cell_type":"code","execution_count":null,"id":"c55a2bdf-6572-4b30-8780-41e2d4cb60fc","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from ruamel import yaml\n","from great_expectations.core.batch import RuntimeBatchRequest\n","from great_expectations.data_context import BaseDataContext\n","from great_expectations.data_context.types.base import (\n"," DataContextConfig,\n"," DatasourceConfig,\n"," FilesystemStoreBackendDefaults,\n",")\n","\n","work_path = \"/lakehouse/default/Files/validation\""]},{"cell_type":"code","execution_count":null,"id":"3f4d64f5-2fa3-4006-a8b7-c81bc5fab2c4","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["data_context_config = DataContextConfig(\n"," datasources={\n"," \"transformed_data_source\": DatasourceConfig(\n"," class_name=\"Datasource\",\n"," # PandasExecutionEngine, or SparkDFExecutionEngine decides which kind of dataframe to use\n"," execution_engine={\"class_name\": \"PandasExecutionEngine\"},\n"," data_connectors={\n"," \"transformed_data_connector\": {\n"," \"module_name\": \"great_expectations.datasource.data_connector\",\n"," \"class_name\": \"RuntimeDataConnector\",\n"," \"batch_identifiers\": [\n"," \"environment\",\n"," ],\n"," }\n"," }\n"," )\n"," },\n"," store_backend_defaults=FilesystemStoreBackendDefaults(root_directory=work_path)\n",")\n","context = BaseDataContext(project_config=data_context_config)"]},{"cell_type":"markdown","id":"04f3d69f-af37-42e9-ae98-304f4aa7f23f","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## 2. Create a Batch Request based on dataframe\n","https://docs.greatexpectations.io/docs/terms/batch"]},{"cell_type":"code","execution_count":null,"id":"8c11fcb2-58f2-4642-9760-db3b8ac8dd92","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["batch_request = RuntimeBatchRequest(\n"," datasource_name=\"transformed_data_source\",\n"," data_connector_name=\"transformed_data_connector\",\n"," data_asset_name=\"nyctaxi_data\",\n"," batch_identifiers={\n"," \"environment\": \"stage\",\n"," },\n"," runtime_parameters={\"batch_data\": pd_df},\n",")"]},{"cell_type":"markdown","id":"acbf4a2f-dbde-46a2-9b41-1282dda522c2","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## 3. Define Expecation Suite and corresponding Data Expectations\n","https://docs.greatexpectations.io/docs/terms/expectation_suite"]},{"cell_type":"code","execution_count":null,"id":"f2a3e427-42a2-4283-b2ff-e4272eb832a3","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["expectation_suite_name = \"Nyctaxi_data_suite_basic\"\n","context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)\n","validator = context.get_validator(\n"," batch_request=batch_request,\n"," expectation_suite_name=expectation_suite_name,\n"," datasource_name=\"transformed_data_source\",\n"," data_connector_name=\"transformed_data_connector\",\n"," data_asset_name=\"nyctaxi_data\",\n",")\n","# Add Validatons to suite\n","# Check available expectations: validator.list_available_expectation_types()\n","validator.expect_column_values_to_be_between(column=\"passenger_count\", min_value=0, max_value=10)\n","validator.expect_column_values_to_not_be_null(column=\"passenger_count\")\n","validator.expect_column_values_to_not_be_null(column=\"trip_distance\")\n","validator.expect_column_values_to_be_of_type(column=\"store_and_fwd_flag\", type_=\"object\")\n","validator.expect_column_values_to_not_be_null(column=\"fare_amount\")\n","validator.expect_column_values_to_be_of_type(column=\"payment_type\", type_=\"int\")\n","# To run validations without checkpoint\n","# validator.validate()\n","validator.save_expectation_suite(discard_failed_expectations=False)"]},{"cell_type":"markdown","id":"ca148323-8442-4def-a782-3bc426255b02","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## 4. Configure a checkpoint and run Expectation suite using checkpoint\n","https://docs.greatexpectations.io/docs/terms/checkpoint"]},{"cell_type":"code","execution_count":null,"id":"0b918d12-497d-4355-a906-24e2561e3ed3","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["my_checkpoint_name = \"Nyctaxi Data\"\n","checkpoint_config = {\n"," \"name\": my_checkpoint_name,\n"," \"config_version\": 1.0,\n"," \"class_name\": \"SimpleCheckpoint\",\n"," \"run_name_template\": \"%Y%m%d-%H%M%S-my-run-name-template\",\n","}\n","my_checkpoint = context.test_yaml_config(yaml.dump(checkpoint_config,default_flow_style=False))\n","context.add_or_update_checkpoint(**checkpoint_config)\n","# Run Checkpoint passing in expectation suite\n","checkpoint_result = context.run_checkpoint(\n"," checkpoint_name=my_checkpoint_name,\n"," validations=[\n"," {\n"," \"batch_request\": batch_request,\n"," \"expectation_suite_name\": expectation_suite_name,\n"," }\n"," ],\n",")"]},{"cell_type":"markdown","id":"da2a3d4c-f423-4d40-87f4-fe130c134e39","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## Report Data Quality Metrics to Azure Monitor"]},{"cell_type":"markdown","id":"6b5173df-1d67-435f-8e5e-9595e478ad51","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Build Data Docs"]},{"cell_type":"code","execution_count":null,"id":"4070479a-1291-4328-a81f-8c73b49ad6af","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["context.build_data_docs()"]},{"cell_type":"code","execution_count":null,"id":"e902bef2-ab21-4cdb-a668-9ad50c19e725","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["with open('/lakehouse/default/Files/validation/validation_results.json', 'w') as f:\n"," f.write(str(checkpoint_result))"]},{"cell_type":"markdown","id":"0eac5b29-b6d1-41cb-b8f9-cd6ae37d3fdc","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Send Data Quality Logs to Azure Monitor"]},{"cell_type":"code","execution_count":null,"id":"300a6a79-08f4-400f-89ee-b5d36f307184","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["import json\n","result_dic = checkpoint_result.to_json_dict()\n","key_name = [key for key in result_dic['run_results'].keys()][0]\n","results = result_dic['run_results'][key_name]['validation_result']['results']\n","\n","checks = {'check_name':checkpoint_result['checkpoint_config']['name'],\n"," 'data_asset_name': result_dic['run_results'][key_name]['validation_result']['meta']['active_batch_definition']['data_asset_name'],\n"," 'datasource_name': result_dic['run_results'][key_name]['validation_result']['meta']['active_batch_definition']['datasource_name'],}\n","\n","for i in range(len(results)):\n"," validation_name = results[i]['expectation_config']['expectation_type'] + \"_on_\" + results[i]['expectation_config']['kwargs']['column']\n"," checks[validation_name] = results[i]['success']\n"," if(checks[validation_name] == False):\n"," detailed_failure_info = 'failure_info_on_' + validation_name\n"," checks[detailed_failure_info] = json.dumps(results[i]['result'])\n"," \n","properties = {'custom_dimensions': checks}"]},{"cell_type":"code","execution_count":null,"id":"3bba2980-afd7-4e40-b724-a0c7736e8dc9","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Report Data Quality Metrics to Azure Monitor using python Azure Monitor open-census exporter \n","import logging\n","import time\n","from opencensus.ext.azure.log_exporter import AzureLogHandler\n","\n","AZURE_MONITOR_SECRET = \"InstrumentationKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"\n","# You can find the related logs in featuresinsightsxxxx\n","# Click 'Logs', and check relevant output logs by querying 'traces' table\n","logger = logging.getLogger(__name__)\n","logger.addHandler(AzureLogHandler(connection_string=AZURE_MONITOR_SECRET))\n","\n","if checkpoint_result.success is True:\n"," logger.setLevel(logging.INFO)\n"," logger.info('verifychecks', extra=properties)\n","else:\n"," logger.setLevel(logging.ERROR)\n"," logger.error('verifychecks', extra=properties)\n"," raise RuntimeError(\n"," \"The Great Expectations validation failed. Check \"\n"," \"the logs or the Great Expectations data docs for more information.\") \n","\n","time.sleep(16)"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{"synapse_widget":{"state":{},"token":"4c4c5a95-7d0a-46ff-8db6-e7b48831705c"}},"language":"python"},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{"application/vnd.jupyter.widget-state+json":{"state":{"004447f23e944f8ba8e00aa8f6dfb436":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"018b3a47a8314abca43dc00175aa7774":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_b2523a56ad5e40778fd1e57ef24ef7b2","IPY_MODEL_655bfa185a574618b9470155f2031462","IPY_MODEL_9fe60c346d1f47e29a80fb23252e5640"],"layout":"IPY_MODEL_36dac228089744239ad3e6d96dc0c074"}},"01c977149bf349f8858f6617ad427449":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"099e2b64e4e74a9aa6eb233ac55d1326":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"0a3051b4cbcd441ca03ef12f04882c46":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_8a60b77f59624f6fac2c3f00b4b1107f","style":"IPY_MODEL_c3e1776a73284ad6889bf57cf1cfd8e4","value":" 1/1 [00:00<00:00, 125.91it/s]"}},"0b1f5956e42f43d091eafc7480965e39":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0e061d7a2a8a4d7b922e039e94f120b4":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0ecd0a0ba1fe40679aa1e906dc3450ae":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0f373dcd84ca48e98f8f3e5a3f1a3e6f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_afa3d23174d041e3a2e521ccfa1de78e","IPY_MODEL_8ee9deef78e34a289f5e60efbb1084d3","IPY_MODEL_9eb3d6df087a4ffba784420ecf868d6e"],"layout":"IPY_MODEL_2524cc60a110413ca00ea286bb6431c1"}},"102b47cebf9b43d5a1b1e61f03aef158":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_44fda475bc5f436580ff6f7f0360a710","style":"IPY_MODEL_b6d8c23728a64ddabee029daa4d26336","value":" 6/6 [00:00<00:00, 51.40it/s]"}},"1081c895fca742239008eae9eaedb4c5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_95e691d9913244d6b03628ea94d74acc","max":8,"style":"IPY_MODEL_70eb1aab4f3f48449abb957cf4e3ea51","value":8}},"1656f62766ac43ceaea29c6f0db02232":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_843882531aa64669bc848600cd9220fc","max":6,"style":"IPY_MODEL_6a98c0d3a66a495680ceadf9852f20cf","value":6}},"1970205870124bd6b142f1f3ee03d1d0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_f699af27d3a3494987c54ad9aebddb89","style":"IPY_MODEL_e226a99abc754773978b88b5a1897897","value":"Calculating Metrics: 100%"}},"1a27d2481b8d4231836b2633b860302b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"1af729b669734af4bc94ec438aa3d94c":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"1bebdcaf99d8424b9ea4382ad8d87ca4":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"1ed9bd564e41413985421c9e2b88ae4d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"1f552688a73c4ce192e57de136574f0d":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"1fc70689c8cc4f1d90dacc2e5ab44586":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_a1923d78f384400ebaad53109b62b583","style":"IPY_MODEL_b426da7478d04f8a900142ea20cdfe93","value":" 8/8 [00:05<00:00, 1.50it/s]"}},"214388f3ffd445ba80939823b47239f5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"22d45b3aefe54f66a8def1eef1ea04ad":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"22ef816fad984942bd3f76730305b847":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"242787a512d5458a98134896e7c3d37b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2524cc60a110413ca00ea286bb6431c1":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"26cea904bbf74137988295bb27a5d19c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2cee8f764ebc42b48df71dd600d6727a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"2e479b43beaa42079d39a19b2f810513":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_9e179de97ae54b4c9add492e5b3192a3","max":1,"style":"IPY_MODEL_36d3c45221644912a0958caa47f66f2d","value":1}},"2e6085eb81824c8f98480aa8189e74b5":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2e8ef73cf092493892f6795939aaf3cd":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2f27a16df12e4644b63ed82512e00f48":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_0e061d7a2a8a4d7b922e039e94f120b4","max":6,"style":"IPY_MODEL_6d7ad0f9784445c2be08769c3060eed0","value":6}},"32cd3f08fd7d41dfa10861fd68d9af8b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"32d05d9786fd4d7585d67a123bf94760":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"3457713bdcc14c7f88fdd31bfa62f45a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"36d3c45221644912a0958caa47f66f2d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"36dac228089744239ad3e6d96dc0c074":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"37524c973d0741719c3c2d5fe79f9509":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_fe5f82b26e9343719abdf33c61d26830","style":"IPY_MODEL_f27abf115a504c44a979b63293173f1a","value":" 8/8 [00:05<00:00, 1.50it/s]"}},"376a64e530954106824c12b46b3c89c3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"3b839d0e7e8f4d86ad40e9a7f3eccd3a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_1f552688a73c4ce192e57de136574f0d","style":"IPY_MODEL_a52447e4120740d18bbaa3e32de32743","value":"Calculating Metrics: 100%"}},"3e48d1e877af44ebab42eaeecd718b71":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"3fa551bac6be438f94b528744bdbc50e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"432a77e4a74c446380582a9b4a3036e8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4363f468267f4b78ad1037db994ed5b8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_d4f17d2d5af849368d0ea40a2e4f42e7","max":23,"style":"IPY_MODEL_83386cde021c4ab1a4ccbad0301030bc","value":23}},"44fda475bc5f436580ff6f7f0360a710":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"45e886b4dcf9491eb66ac4e3a964bdd0":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"479fdc7e2d6e494e9fd32aae7c4b88de":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_93ea6576e8ba4ae39bac8d3cb36a6ce0","IPY_MODEL_2e479b43beaa42079d39a19b2f810513","IPY_MODEL_fe7f6ca0d3494e88b98855c27f463446"],"layout":"IPY_MODEL_a998e90371b04fd1908c233399286a37"}},"480175b1f155490f85c292938d6fd9fe":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"480839fe26d1442fb804004ba380f16a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"48b8b3a1febc44d8b527abb45a4c6e93":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"49cccc655ad140eaa0ce6b0682ac07ca":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4c7d8542266d4499ae0a27248ac9dad4":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"4caf6ac8eb6443dc930f180ec47e4449":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4e965d3296ad41b6a08cbcc22a46aa1a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"4f0d70e5711f422b8e2cf06682ff13b5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_22ef816fad984942bd3f76730305b847","style":"IPY_MODEL_e9100a732cd84ec99d6713e96836be1b","value":" 1/1 [00:00<00:00, 129.66it/s]"}},"52cc7c8e818043c4aff7ab677df9cae6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_3b839d0e7e8f4d86ad40e9a7f3eccd3a","IPY_MODEL_802d9d13243c42b3b9f4f68f9321e7c6","IPY_MODEL_ed654532b3d7439080c348131fc9c0c9"],"layout":"IPY_MODEL_ca73ebe5e4cd49e5897f3ddc028470cc"}},"534597d52d8645bc8757ff1425ab3a17":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"54a8a08b797d44a9a8561f68bdd55459":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"562ca2ade9e3465aa68bdab306444a39":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"576aded4582246d2b39f3cab9b56ed26":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"57d8d1f4e5634b8e90f6ecdb50cdc12d":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"58048cc7745940e18150a63c0497738a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_fa82ced395de44e6b8b70ff3ff113bf9","style":"IPY_MODEL_df7cc85b6690421d9aa7a0d2594b4049","value":"Calculating Metrics: 100%"}},"59bc4c86d0f045f9a042c2b5d070055b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_8bea0ad7f56b49bab68c17c320893f61","IPY_MODEL_dbaa0230ebc748488f48b228d4170fe9","IPY_MODEL_a84177c22a674df9a42f45812e3132c0"],"layout":"IPY_MODEL_ace1fc6106054eedbdf087bed1eadf44"}},"59f43dfca0054f15a99268f508124a97":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_b8253946f31d4045aa124d5d90ecbaee","IPY_MODEL_a4715e13af9f409aaec7428886ea5a20","IPY_MODEL_1fc70689c8cc4f1d90dacc2e5ab44586"],"layout":"IPY_MODEL_610c5f44dd37430cb2b42bdb4fce9a8b"}},"600fc63819c14197a4e9d0b91645149e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_945abfb337c848ddaf42d7484953e699","IPY_MODEL_ab6d06faa07841d280535b96517d727a","IPY_MODEL_0a3051b4cbcd441ca03ef12f04882c46"],"layout":"IPY_MODEL_4c7d8542266d4499ae0a27248ac9dad4"}},"60790356c19e42b8be6d813647e3cbfe":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_45e886b4dcf9491eb66ac4e3a964bdd0","style":"IPY_MODEL_bee7144b79bb41718fe8f8e79f779e3e","value":"Calculating Metrics: 100%"}},"610c5f44dd37430cb2b42bdb4fce9a8b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"6308e380d4f24f9fa9608e71e718aec1":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"655bfa185a574618b9470155f2031462":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_ca0edd0c4f29493f9547ccaa1fb983f3","max":1,"style":"IPY_MODEL_432a77e4a74c446380582a9b4a3036e8","value":1}},"6978e9f94e0b48698010fdd7b6ed2346":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"6a38df1026524f01a91ed24f280b49e0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"6a98c0d3a66a495680ceadf9852f20cf":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"6d7ad0f9784445c2be08769c3060eed0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"6f289eb908b34f21be8aaec3719e0fed":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"6f5de1653caf45309e3fbee63e0f782f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_6978e9f94e0b48698010fdd7b6ed2346","max":6,"style":"IPY_MODEL_49cccc655ad140eaa0ce6b0682ac07ca","value":6}},"70eb1aab4f3f48449abb957cf4e3ea51":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"728be7b6e2f740b6b5f7cee553b5fa0d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"73da67d9b5044d1585a97df692aa5439":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"7727e1a53f264d2cbf099204c076fafa":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_95689701bc6446e6844a1ab3ca80ddf6","IPY_MODEL_4363f468267f4b78ad1037db994ed5b8","IPY_MODEL_eccc28fbe2024199bf2d72f7f54b9505"],"layout":"IPY_MODEL_4e965d3296ad41b6a08cbcc22a46aa1a"}},"78572d2f8b394e73acacb3dd0e97926f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_d302f2180cc94fda98787ea3d278f6fa","style":"IPY_MODEL_eac72d9582d741f1839c6577a127f8f8","value":"Calculating Metrics: 100%"}},"802d9d13243c42b3b9f4f68f9321e7c6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_54a8a08b797d44a9a8561f68bdd55459","max":23,"style":"IPY_MODEL_1ed9bd564e41413985421c9e2b88ae4d","value":23}},"83386cde021c4ab1a4ccbad0301030bc":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"843882531aa64669bc848600cd9220fc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"867c1e780e994f2bab762ad9e97a8414":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_0ecd0a0ba1fe40679aa1e906dc3450ae","style":"IPY_MODEL_562ca2ade9e3465aa68bdab306444a39","value":" 6/6 [00:00<00:00, 33.84it/s]"}},"8a60b77f59624f6fac2c3f00b4b1107f":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"8bea0ad7f56b49bab68c17c320893f61":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_32cd3f08fd7d41dfa10861fd68d9af8b","style":"IPY_MODEL_8dc1642e60164570b4426180fdce16e7","value":"Calculating Metrics: 100%"}},"8c2aa368d6124f34800a0f74d92a3b3b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8c2c7e44a3a742ab8a84413d2a95e0af":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8c67e116afc34cdfa1c4e192e59851ff":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8d1c181c353b45d096c471435f03f7c0":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"8dc1642e60164570b4426180fdce16e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8ee9deef78e34a289f5e60efbb1084d3":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_22d45b3aefe54f66a8def1eef1ea04ad","max":6,"style":"IPY_MODEL_4caf6ac8eb6443dc930f180ec47e4449","value":6}},"93ea6576e8ba4ae39bac8d3cb36a6ce0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_99cd10935def4bf1aab2a185a830b761","style":"IPY_MODEL_bc3045b91698442396ec8ba1fe59aace","value":"Calculating Metrics: 100%"}},"945abfb337c848ddaf42d7484953e699":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_9f9d53657e1e4aa8b14fc410fd698c2c","style":"IPY_MODEL_099e2b64e4e74a9aa6eb233ac55d1326","value":"Calculating Metrics: 100%"}},"94e5499a5284494d90a4cbdbe2cf7c6b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"95689701bc6446e6844a1ab3ca80ddf6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_bc667c51b9f846979c760e2893f991d8","style":"IPY_MODEL_d85392f7b90548a58047e5f7f45f6436","value":"Calculating Metrics: 100%"}},"95e691d9913244d6b03628ea94d74acc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"99cd10935def4bf1aab2a185a830b761":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9e179de97ae54b4c9add492e5b3192a3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9eb3d6df087a4ffba784420ecf868d6e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_48b8b3a1febc44d8b527abb45a4c6e93","style":"IPY_MODEL_8c67e116afc34cdfa1c4e192e59851ff","value":" 6/6 [00:00<00:00, 49.85it/s]"}},"9f9d53657e1e4aa8b14fc410fd698c2c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9fe60c346d1f47e29a80fb23252e5640":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_e5e96afa05bc4888ac5b1489d0fb4893","style":"IPY_MODEL_214388f3ffd445ba80939823b47239f5","value":" 1/1 [00:00<00:00, 124.56it/s]"}},"a0a2c9d95264461e9d1252babf2f6656":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_58048cc7745940e18150a63c0497738a","IPY_MODEL_1081c895fca742239008eae9eaedb4c5","IPY_MODEL_37524c973d0741719c3c2d5fe79f9509"],"layout":"IPY_MODEL_fe7d5df2db1b4f6da36387d9b6d2a614"}},"a1923d78f384400ebaad53109b62b583":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"a4715e13af9f409aaec7428886ea5a20":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_26cea904bbf74137988295bb27a5d19c","max":8,"style":"IPY_MODEL_32d05d9786fd4d7585d67a123bf94760","value":8}},"a52447e4120740d18bbaa3e32de32743":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"a84177c22a674df9a42f45812e3132c0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_376a64e530954106824c12b46b3c89c3","style":"IPY_MODEL_6a38df1026524f01a91ed24f280b49e0","value":" 6/6 [00:00<00:00, 48.53it/s]"}},"a998e90371b04fd1908c233399286a37":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ab6d06faa07841d280535b96517d727a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_2e6085eb81824c8f98480aa8189e74b5","max":1,"style":"IPY_MODEL_534597d52d8645bc8757ff1425ab3a17","value":1}},"ac0d4a81cabf4dd7955f1bbaf38578a5":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"acd23c095ecc462eb363c4cf4323f276":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ace1fc6106054eedbdf087bed1eadf44":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"afa3d23174d041e3a2e521ccfa1de78e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_d6724465b2e7421a886e2d502ec68095","style":"IPY_MODEL_480175b1f155490f85c292938d6fd9fe","value":"Calculating Metrics: 100%"}},"b2523a56ad5e40778fd1e57ef24ef7b2":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_2e8ef73cf092493892f6795939aaf3cd","style":"IPY_MODEL_6308e380d4f24f9fa9608e71e718aec1","value":"Calculating Metrics: 100%"}},"b426da7478d04f8a900142ea20cdfe93":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"b4bcb692d56c456ca87bcaec59af9d4e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_6f289eb908b34f21be8aaec3719e0fed","style":"IPY_MODEL_8c2aa368d6124f34800a0f74d92a3b3b","value":" 6/6 [00:00<00:00, 50.83it/s]"}},"b6d8c23728a64ddabee029daa4d26336":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"b7dd1fd5391b48e3a5c20b757efe4a40":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_1970205870124bd6b142f1f3ee03d1d0","IPY_MODEL_fdba21199a194464ba4097a5f5c734f2","IPY_MODEL_d8730b191766440881bce0ab567f0d5b"],"layout":"IPY_MODEL_576aded4582246d2b39f3cab9b56ed26"}},"b8253946f31d4045aa124d5d90ecbaee":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_ac0d4a81cabf4dd7955f1bbaf38578a5","style":"IPY_MODEL_1bebdcaf99d8424b9ea4382ad8d87ca4","value":"Calculating Metrics: 100%"}},"bab97edbebe44417b266ba4be0370c28":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"bc3045b91698442396ec8ba1fe59aace":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"bc667c51b9f846979c760e2893f991d8":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"bee7144b79bb41718fe8f8e79f779e3e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"c3aebf50a10046a6aa535be64dc7dbe9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_480839fe26d1442fb804004ba380f16a","max":1,"style":"IPY_MODEL_1a27d2481b8d4231836b2633b860302b","value":1}},"c3e1776a73284ad6889bf57cf1cfd8e4":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"ca0edd0c4f29493f9547ccaa1fb983f3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ca73ebe5e4cd49e5897f3ddc028470cc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d302f2180cc94fda98787ea3d278f6fa":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d4f17d2d5af849368d0ea40a2e4f42e7":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d527245bcc444a20a726033b34b7e666":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_78572d2f8b394e73acacb3dd0e97926f","IPY_MODEL_c3aebf50a10046a6aa535be64dc7dbe9","IPY_MODEL_4f0d70e5711f422b8e2cf06682ff13b5"],"layout":"IPY_MODEL_01c977149bf349f8858f6617ad427449"}},"d5ae0e31155c46c0830a393e5c1b8cd5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_dd8f9257a4ae41b1b2e3834349745549","IPY_MODEL_6f5de1653caf45309e3fbee63e0f782f","IPY_MODEL_102b47cebf9b43d5a1b1e61f03aef158"],"layout":"IPY_MODEL_e8633d0e61724ab9ae9a3c181178694c"}},"d6724465b2e7421a886e2d502ec68095":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d85392f7b90548a58047e5f7f45f6436":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"d8730b191766440881bce0ab567f0d5b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_0b1f5956e42f43d091eafc7480965e39","style":"IPY_MODEL_2cee8f764ebc42b48df71dd600d6727a","value":" 6/6 [00:00<00:00, 44.57it/s]"}},"db3d4138c490459c8b1b68c029848c12":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_242787a512d5458a98134896e7c3d37b","style":"IPY_MODEL_94e5499a5284494d90a4cbdbe2cf7c6b","value":"Calculating Metrics: 100%"}},"dbaa0230ebc748488f48b228d4170fe9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_3457713bdcc14c7f88fdd31bfa62f45a","max":6,"style":"IPY_MODEL_3fa551bac6be438f94b528744bdbc50e","value":6}},"dd8f9257a4ae41b1b2e3834349745549":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_ed46ea00c53a4665ae057e8ac119a1da","style":"IPY_MODEL_bab97edbebe44417b266ba4be0370c28","value":"Calculating Metrics: 100%"}},"df7cc85b6690421d9aa7a0d2594b4049":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"e226a99abc754773978b88b5a1897897":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"e5e96afa05bc4888ac5b1489d0fb4893":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"e8633d0e61724ab9ae9a3c181178694c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"e9100a732cd84ec99d6713e96836be1b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"ea2d294662854523b77d217d4950a81b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_60790356c19e42b8be6d813647e3cbfe","IPY_MODEL_1656f62766ac43ceaea29c6f0db02232","IPY_MODEL_b4bcb692d56c456ca87bcaec59af9d4e"],"layout":"IPY_MODEL_fe8e972de1fc44ceaffaf2a343a5d6c8"}},"eac72d9582d741f1839c6577a127f8f8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"eccc28fbe2024199bf2d72f7f54b9505":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_73da67d9b5044d1585a97df692aa5439","style":"IPY_MODEL_728be7b6e2f740b6b5f7cee553b5fa0d","value":" 23/23 [00:10<00:00, 2.12it/s]"}},"ed46ea00c53a4665ae057e8ac119a1da":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ed654532b3d7439080c348131fc9c0c9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_8d1c181c353b45d096c471435f03f7c0","style":"IPY_MODEL_8c2c7e44a3a742ab8a84413d2a95e0af","value":" 23/23 [00:10<00:00, 2.18it/s]"}},"f27abf115a504c44a979b63293173f1a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"f2cb18a44a69406ea6a22490b6ee8426":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_db3d4138c490459c8b1b68c029848c12","IPY_MODEL_2f27a16df12e4644b63ed82512e00f48","IPY_MODEL_867c1e780e994f2bab762ad9e97a8414"],"layout":"IPY_MODEL_57d8d1f4e5634b8e90f6ecdb50cdc12d"}},"f699af27d3a3494987c54ad9aebddb89":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fa82ced395de44e6b8b70ff3ff113bf9":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fdba21199a194464ba4097a5f5c734f2":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_3e48d1e877af44ebab42eaeecd718b71","max":6,"style":"IPY_MODEL_1af729b669734af4bc94ec438aa3d94c","value":6}},"fe5f82b26e9343719abdf33c61d26830":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe7d5df2db1b4f6da36387d9b6d2a614":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe7f6ca0d3494e88b98855c27f463446":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_acd23c095ecc462eb363c4cf4323f276","style":"IPY_MODEL_004447f23e944f8ba8e00aa8f6dfb436","value":" 1/1 [00:00<00:00, 127.18it/s]"}},"fe8e972de1fc44ceaffaf2a343a5d6c8":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}}},"version_major":2,"version_minor":0}}},"nbformat":4,"nbformat_minor":5} \ No newline at end of file diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/exploratory_data_analysis.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/exploratory_data_analysis.ipynb new file mode 100644 index 000000000..3002416c8 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/exploratory_data_analysis.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"fb692fa2","metadata":{},"source":["### Load ingested data from staging zone"]},{"cell_type":"code","execution_count":null,"id":"a1f94d23","metadata":{},"outputs":[],"source":["import pandas as pd\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import seaborn as sns"]},{"cell_type":"code","execution_count":null,"id":"6a719fb9","metadata":{},"outputs":[],"source":["# Load Yellow Taxi Trip Records parquet file from staging zone to pandas dataframe\n","year = \"2022\"\n","staging_path = \"02_staging\"\n","\n","pd_df = pd.read_parquet(f\"/lakehouse/default/Files/{staging_path}/yellow_taxi_tripdata_{year}.parquet\", engine=\"pyarrow\")\n","pd_df.head()"]},{"cell_type":"code","execution_count":null,"id":"56e6a87f","metadata":{},"outputs":[],"source":["# Load location zones data from landing zone\n","landing_path = \"01_landing\"\n","zones_df = pd.read_csv(f\"/lakehouse/default/Files/{landing_path}/taxi_zone_lookup.csv\")\n","zones_df.head()\n"]},{"cell_type":"markdown","id":"918cf82d","metadata":{},"source":["## EDA"]},{"cell_type":"code","execution_count":null,"id":"f23ef820","metadata":{},"outputs":[],"source":["# Check null values for columns\n","pd_df.isnull().sum()"]},{"cell_type":"code","execution_count":null,"id":"50904bd6","metadata":{},"outputs":[],"source":["# Check unknown (264 and 265) location for PULocationID columns\n","pd_df[(pd_df[\"PULocationID\"] == 264) | (pd_df[\"PULocationID\"] == 265)]"]},{"cell_type":"code","execution_count":null,"id":"37afb3dc","metadata":{},"outputs":[],"source":["sns.displot(pd_df[\"passenger_count\"], kde=True, stat=\"density\")\n","plt.show()"]},{"cell_type":"code","execution_count":null,"id":"1964d672","metadata":{},"outputs":[],"source":["# Check location zones data\n","zones_df.isnull().sum()"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python","ms_spell_check":{"ms_spell_check_language":"en"}},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} \ No newline at end of file diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/feature_set_registration.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/feature_set_registration.ipynb new file mode 100644 index 000000000..cb2bfbe5f --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/feature_set_registration.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"23ecd0f8-8bfe-48f9-8697-8c5ca47d354c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Parameters Setup"]},{"cell_type":"code","execution_count":null,"id":"5c3df160-0c42-49a0-8c71-f8fc0b121525","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"tags":["parameters"]},"outputs":[],"source":["featurestore_subscription_id = spark.conf.get(\"spark.fsd.subscription_id\")\n","featurestore_resource_group_name = spark.conf.get(\"spark.fsd.rg_name\")\n","featurestore_name = spark.conf.get(\"spark.fsd.name\")\n","\n","year = \"2022\"\n","standard_path = \"03_standard\"\n","\n","# the service principle should be albe to access feature store\n","client_id = spark.conf.get(\"spark.fsd.client_id\")\n","tenant_id = spark.conf.get(\"spark.fsd.tenant_id\")\n","client_secret = \"\""]},{"cell_type":"code","execution_count":null,"id":"45952099-d76a-42c5-91c4-019384080b0b","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run utils"]},{"cell_type":"code","execution_count":null,"id":"73c5305d-5916-4439-806c-fa96032229fa","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["fabric_onelake_tenant, fabric_workspace_id, fabric_lakehouse_id = get_onelake_info()\n"]},{"cell_type":"markdown","id":"9b1ea2c7-a16d-46e1-bdda-ebd81bd86a1d","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Obtain Credential and Initiate Clients "]},{"cell_type":"code","execution_count":null,"id":"53fb0203-3a1e-4528-a58b-cdcd8794d692","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from azure.identity import ClientSecretCredential\n","\n","credential = ClientSecretCredential(\n"," tenant_id=tenant_id,\n"," client_id=client_id,\n"," client_secret=client_secret\n",")"]},{"cell_type":"code","execution_count":null,"id":"32181201-1be6-46ad-88e4-b979d571e966","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# feature store client\n","from azureml.featurestore import FeatureStoreClient\n","from azure.ai.ml import MLClient\n","from azure.ai.ml.entities import (\n"," FeatureStore,\n"," FeatureStoreEntity,\n"," FeatureSet,\n",")\n","\n","featurestore = FeatureStoreClient(\n"," credential=credential,\n"," subscription_id=featurestore_subscription_id,\n"," resource_group_name=featurestore_resource_group_name,\n"," name=featurestore_name,\n",")\n","\n","\n","fs_client = MLClient(\n"," credential,\n"," featurestore_subscription_id,\n"," featurestore_resource_group_name,\n"," featurestore_name,\n",")"]},{"cell_type":"markdown","id":"229854e3-a017-4269-982e-55d7c008df5b","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Create Entities & Feature Sets"]},{"cell_type":"markdown","id":"2a519f66-583a-4c54-b7a2-6f05811ec544","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Create tmp folder"]},{"cell_type":"code","execution_count":null,"id":"b7999fe3-0575-4402-86e1-bae64b7f5db1","metadata":{"editable":true,"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"run_control":{"frozen":false}},"outputs":[],"source":["import time\n","import os\n","\n","version = str(int(time.time()))\n","\n","featureset_folder = f\"featuresets/nyctaxi_fs_1/{version}\"\n","os.makedirs(f\"{featureset_folder}/spec\", exist_ok=True)\n","\n","print(f\"the version is: {version}\")\n","print(f\"feature set folder is {featureset_folder}\")"]},{"cell_type":"markdown","id":"91c3f9d5-d666-4137-ba0f-7e9da9d3ae81","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Create FeatureSetSpec.yaml"]},{"cell_type":"code","execution_count":null,"id":"60d4f0fe-2ed3-4827-b25a-08842a59a896","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"tags":[]},"outputs":[],"source":["nyctaxi_table_path = f\"abfss://{fabric_workspace_id}@{fabric_onelake_tenant}.dfs.fabric.microsoft.com/{fabric_lakehouse_id}/Files/{standard_path}/transformed_yellow_taxi_tripdata_{year}.csv\""]},{"cell_type":"code","execution_count":null,"id":"647d42fb-47b4-42b6-9359-e11f53b70561","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["with open(f\"{featureset_folder}/spec/FeatureSetSpec.yaml\", \"w\") as spec_yaml:\n"," spec_yaml.write(f\"\"\"\n","$schema: http://azureml/sdk-2-0/FeatureSetSpec.json\n","\n","source:\n"," type: csv\n"," path: {nyctaxi_table_path}\n"," timestamp_column: \n"," name: pickup_timestamp\n","features: \n"," - name: hour_pickup\n"," type: integer\n"," - name: weekday_pickup\n"," type: integer\n"," - name: scaled_demand\n"," type: double\n","index_columns:\n"," - name: borough_id\n"," type: integer \n"," \"\"\")"]},{"cell_type":"code","execution_count":null,"id":"2c069a94-690b-4597-bb1b-d3dea16f4edc","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["with open(f\"{featureset_folder}/spec/FeatureSetSpec.yaml\", \"r\") as spec_yaml:\n"," print(spec_yaml.read())"]},{"cell_type":"markdown","id":"3a422f80-eaea-4a13-946a-984960830c19","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Register Entities"]},{"cell_type":"code","execution_count":null,"id":"b22daee4-8418-4bef-bba7-a0d11b1809ae","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from azure.ai.ml.entities import DataColumn, DataColumnType\n","\n","vendor_entity_config = FeatureStoreEntity(\n"," name=\"borough\",\n"," version=version,\n"," index_columns=[DataColumn(name=\"borough_id\", type=DataColumnType.INTEGER)],\n"," stage=\"Development\",\n"," description=\"This entity represents index key of Boroughs.\",\n"," tags={\"data_typ\": \"nonPII\"},\n",")\n","\n","poller = fs_client.feature_store_entities.begin_create_or_update(vendor_entity_config)\n","print(poller.result())"]},{"cell_type":"markdown","id":"de8d438e-9d42-4dcd-96e5-6fa438f1779d","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Register Feature Sets"]},{"cell_type":"code","execution_count":null,"id":"e3764e77-8ac6-4887-bfe7-a4b692cc5c7c","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from azure.ai.ml.entities import FeatureSetSpecification\n","\n","nyctaxi_featureset_spec_folder = f\"{featureset_folder}/spec\"\n","\n","nyctaxi_fset_config = FeatureSet(\n"," name=\"nyctaxi\",\n"," version=version,\n"," description=\"nyc taxi data\",\n"," entities=[f\"azureml:borough:{version}\"],\n"," stage=\"Development\",\n"," specification=FeatureSetSpecification(path=nyctaxi_featureset_spec_folder),\n"," tags={\"data_type\": \"nonPII\"},\n",")\n","\n","poller = fs_client.feature_sets.begin_create_or_update(nyctaxi_fset_config)\n","print(poller.result())"]},{"cell_type":"markdown","id":"05263dc4-8b32-4cfb-ac3c-63f3ec7917ae","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Try to Retrieve Features"]},{"cell_type":"code","execution_count":null,"id":"2da0b434-8aae-4347-89d5-758d45688f91","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["nyctaxi_fset = featurestore.feature_sets.get(\"nyctaxi\", version)\n","nyctaxi_fset.features"]},{"cell_type":"code","execution_count":null,"id":"97284e79-fa8b-4b9c-bd5b-b26c215abf6f","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["df = nyctaxi_fset.to_spark_dataframe().head(5)\n","display(df)"]},{"cell_type":"markdown","id":"3b1ddb90-5052-4bf4-9a95-fe7dc31f74a8","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Register MFS features lineage to Purview"]},{"cell_type":"code","execution_count":null,"id":"3d5e11a7-2c8d-437d-9a1b-5204865ed299","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run data_catalog_and_lineage"]},{"cell_type":"code","execution_count":null,"id":"83945aa2-0044-40d0-a493-07276ee886a9","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["purview_data_catalog = PurviewDataCatalog()\n","\n","featureset_name = \"nyctaxi\"\n","fset = featurestore.feature_sets.get(featureset_name, version)\n","\n","# Create data source assets\n","data_source_assets = []\n","data_source_path = fset.source.path\n","source_qualified_name = data_source_path\n","source_name = data_source_path.split(\"/\")[-1]\n","source_type = source_name.split(\".\")[-1]\n","data_source_asset = DataAsset(source_name, source_type, source_qualified_name)\n","data_source_assets.append(data_source_asset)\n","\n","# Create feature assets\n","target_features = [feat.name for feat in fset.features]\n","feature_assets = purview_data_catalog.prepare_feature_assets(featurestore_name,\n"," fset,\n"," target_features,\n"," **{\"tenant_id\": tenant_id,\n"," \"subscription_id\": featurestore_subscription_id,\n"," \"resource_group\": featurestore_resource_group_name})\n","\n","# Create process asset if there's transformation found\n","process_asset = None\n","if fset.feature_transformation_code:\n"," transformer_class = feature_transformation_code.transformer_class\n"," process_entity = DataAsset(transformer_class,\n"," \"process\",\n"," f\"{feature_transformation_code.path}/{transformer_class}\")\n","\n","feature_lineage = DataLineage(input_data_assets=data_source_assets,\n"," output_data_assets=feature_assets,\n"," process_asset=process_asset)\n"]},{"cell_type":"code","execution_count":null,"id":"b1f7d5d3-18d1-40de-9e1b-b439217c1d48","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Register features lineage\n","purview_data_catalog.register_lineage(feature_lineage)\n"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{"synapse_widget":{"state":{"35f08cc4-8b03-47f1-8bce-d7af1c80e1f0":{"persist_state":{"view":{"chartOptions":{"aggregationType":"sum","binsNumber":10,"categoryFieldKeys":["1"],"chartType":"line","isStacked":false,"seriesFieldKeys":["2"],"wordFrequency":"-1"},"tableOptions":{},"type":"details"}},"sync_state":{"isSummary":false,"language":"scala","table":{"rows":[{"0":"2","1":"2022-01-10 00:55:59","2":"0","3":"0","4":"1.1817816540216029E-4","index":1},{"0":"5","1":"2022-01-03 00:31:06","2":"0","3":"0","4":"7.327046254933938E-4","index":2},{"0":"3","1":"2022-01-03 00:02:48","2":"0","3":"0","4":"0.0835283273062469","index":3},{"0":"1","1":"2022-01-03 00:02:04","2":"0","3":"0","4":"0.04270958897634073","index":4},{"0":"4","1":"2022-01-03 00:55:10","2":"0","3":"0","4":"2.3635633080432058E-5","index":5}],"schema":[{"key":"0","name":"borough_id","type":"bigint"},{"key":"1","name":"pickup_timestamp","type":"timestamp"},{"key":"2","name":"hour_pickup","type":"bigint"},{"key":"3","name":"weekday_pickup","type":"bigint"},{"key":"4","name":"scaled_demand","type":"double"}],"truncated":false}},"type":"Synapse.DataFrame"},"f3236d2f-1792-4796-bacd-28242cd76535":{"persist_state":{"view":{"chartOptions":{"aggregationType":"count","binsNumber":10,"categoryFieldKeys":["0"],"chartType":"bar","isStacked":false,"seriesFieldKeys":["0"],"wordFrequency":"-1"},"tableOptions":{},"type":"details"}},"sync_state":{"isSummary":false,"language":"scala","table":{"rows":[{"0":"/lakehouse/default","1":"/default","2":"default_lh","3":"abfss://1ba98bca-0dde-4f68-94a5-e95121aad42e@msit-onelake.dfs.fabric.microsoft.com/bab36b8e-9831-4432-b5c2-30000709f177","4":"Lakehouse","index":1},{"0":"/synfs/nb_resource/builtin","1":"/nb_resource/builtin","2":"nb_resource","3":"Notebook Working Directory","4":"Notebook Working Directory","index":2},{"0":"/synfs/nb_resource/env","1":"/nb_resource/env","2":"nb_resource","3":"Environment Working Directory","4":"Environment Working Directory","index":3}],"schema":[{"key":"0","name":"localPath","type":"string"},{"key":"1","name":"mountPoint","type":"string"},{"key":"2","name":"scope","type":"string"},{"key":"3","name":"source","type":"string"},{"key":"4","name":"storageType","type":"string"}],"truncated":false}},"type":"Synapse.DataFrame"}},"token":"b520cd0a-1b3f-4478-9a6d-a9c40897b901"}},"language":"python","ms_spell_check":{"ms_spell_check_language":"en"}},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{"35f08cc4-8b03-47f1-8bce-d7af1c80e1f0":{"persist_state":{"view":{"chartOptions":{"aggregationType":"sum","binsNumber":10,"categoryFieldKeys":["1"],"chartType":"line","isStacked":false,"seriesFieldKeys":["2"],"wordFrequency":"-1"},"tableOptions":{},"type":"details"}},"sync_state":{"isSummary":false,"language":"scala","table":{"rows":[{"0":"2","1":"2022-01-10 00:55:59","2":"0","3":"0","4":"1.1817816540216029E-4","index":1},{"0":"5","1":"2022-01-03 00:31:06","2":"0","3":"0","4":"7.327046254933938E-4","index":2},{"0":"3","1":"2022-01-03 00:02:48","2":"0","3":"0","4":"0.0835283273062469","index":3},{"0":"1","1":"2022-01-03 00:02:04","2":"0","3":"0","4":"0.04270958897634073","index":4},{"0":"4","1":"2022-01-03 00:55:10","2":"0","3":"0","4":"2.3635633080432058E-5","index":5}],"schema":[{"key":"0","name":"borough_id","type":"bigint"},{"key":"1","name":"pickup_timestamp","type":"timestamp"},{"key":"2","name":"hour_pickup","type":"bigint"},{"key":"3","name":"weekday_pickup","type":"bigint"},{"key":"4","name":"scaled_demand","type":"double"}],"truncated":false}},"type":"Synapse.DataFrame"},"f3236d2f-1792-4796-bacd-28242cd76535":{"persist_state":{"view":{"chartOptions":{"aggregationType":"count","binsNumber":10,"categoryFieldKeys":["0"],"chartType":"bar","isStacked":false,"seriesFieldKeys":["0"],"wordFrequency":"-1"},"tableOptions":{},"type":"details"}},"sync_state":{"isSummary":false,"language":"scala","table":{"rows":[{"0":"/lakehouse/default","1":"/default","2":"default_lh","3":"abfss://1ba98bca-0dde-4f68-94a5-e95121aad42e@msit-onelake.dfs.fabric.microsoft.com/bab36b8e-9831-4432-b5c2-30000709f177","4":"Lakehouse","index":1},{"0":"/synfs/nb_resource/builtin","1":"/nb_resource/builtin","2":"nb_resource","3":"Notebook Working Directory","4":"Notebook Working Directory","index":2},{"0":"/synfs/nb_resource/env","1":"/nb_resource/env","2":"nb_resource","3":"Environment Working Directory","4":"Environment Working Directory","index":3}],"schema":[{"key":"0","name":"localPath","type":"string"},{"key":"1","name":"mountPoint","type":"string"},{"key":"2","name":"scope","type":"string"},{"key":"3","name":"source","type":"string"},{"key":"4","name":"storageType","type":"string"}],"truncated":false}},"type":"Synapse.DataFrame"}},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/feature_set_retrieval.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/feature_set_retrieval.ipynb new file mode 100644 index 000000000..c1032f979 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/feature_set_retrieval.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"23ecd0f8-8bfe-48f9-8697-8c5ca47d354c","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Parameters Setup"]},{"cell_type":"code","execution_count":null,"id":"5c3df160-0c42-49a0-8c71-f8fc0b121525","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}},"tags":["parameters"]},"outputs":[],"source":["featureset_name = \"nyctaxi\"\n","\n","featurestore_subscription_id = spark.conf.get(\"spark.fsd.subscription_id\")\n","featurestore_resource_group_name = spark.conf.get(\"spark.fsd.rg_name\")\n","featurestore_name = spark.conf.get(\"spark.fsd.name\")\n","\n","fabric_tenant = spark.conf.get(\"spark.fsd.fabric.tenant\")\n","\n","# the service principle should be albe to access feature store\n","client_id = spark.conf.get(\"spark.fsd.client_id\")\n","tenant_id = spark.conf.get(\"spark.fsd.tenant_id\")\n","client_secret = \"\""]},{"cell_type":"markdown","id":"9b1ea2c7-a16d-46e1-bdda-ebd81bd86a1d","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Obtain Credential and Initiate Clients "]},{"cell_type":"code","execution_count":null,"id":"53fb0203-3a1e-4528-a58b-cdcd8794d692","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from azure.identity import ClientSecretCredential\n","\n","credential = ClientSecretCredential(\n"," tenant_id=tenant_id,\n"," client_id=client_id,\n"," client_secret=client_secret\n",")"]},{"cell_type":"code","execution_count":null,"id":"32181201-1be6-46ad-88e4-b979d571e966","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# feature store client\n","from azureml.featurestore import FeatureStoreClient\n","from azure.ai.ml import MLClient\n","from azure.ai.ml.entities import (\n"," FeatureStore,\n"," FeatureStoreEntity,\n"," FeatureSet,\n",")\n","\n","featurestore = FeatureStoreClient(\n"," credential=credential,\n"," subscription_id=featurestore_subscription_id,\n"," resource_group_name=featurestore_resource_group_name,\n"," name=featurestore_name,\n",")\n","\n","\n","fs_client = MLClient(\n"," credential,\n"," featurestore_subscription_id,\n"," featurestore_resource_group_name,\n"," featurestore_name,\n",")"]},{"cell_type":"markdown","id":"05263dc4-8b32-4cfb-ac3c-63f3ec7917ae","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### List All Sets and Versions"]},{"cell_type":"code","execution_count":null,"id":"b678d69c-35f2-4faf-aa38-3a529cfc045b","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["all_featuresets = {}\n","for fset in fs_client.feature_sets.list():\n"," all_featuresets[fset.name] = fset.latest_version\n","all_featuresets"]},{"cell_type":"markdown","id":"b5297280-ce7b-4f75-a561-e58b3ac17db4","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["### Try to Retrieve Features"]},{"cell_type":"code","execution_count":null,"id":"db989f96-1aff-4ca5-be7b-17988ef96699","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["df = featurestore.feature_sets.get(featureset_name, all_featuresets[featureset_name]).to_spark_dataframe()"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{"synapse_widget":{"state":{"bf658fb9-909a-4c21-9917-a5e9c12785a0":{"persist_state":{"view":{"chartOptions":{"aggregationType":"sum","binsNumber":10,"categoryFieldKeys":["1"],"chartType":"line","isStacked":false,"seriesFieldKeys":["2"],"wordFrequency":"-1"},"tableOptions":{},"type":"details"}},"sync_state":{"isSummary":false,"language":"scala","table":{"rows":[{"0":"2","1":"2022-01-10 00:55:59","2":"0","3":"0","4":"1.1817816540216029E-4","index":1},{"0":"5","1":"2022-01-03 00:31:06","2":"0","3":"0","4":"7.327046254933938E-4","index":2},{"0":"3","1":"2022-01-03 00:02:48","2":"0","3":"0","4":"0.0835283273062469","index":3},{"0":"1","1":"2022-01-03 00:02:04","2":"0","3":"0","4":"0.04270958897634073","index":4},{"0":"4","1":"2022-01-03 00:55:10","2":"0","3":"0","4":"2.3635633080432058E-5","index":5}],"schema":[{"key":"0","name":"borough_id","type":"bigint"},{"key":"1","name":"pickup_timestamp","type":"timestamp"},{"key":"2","name":"hour_pickup","type":"bigint"},{"key":"3","name":"weekday_pickup","type":"bigint"},{"key":"4","name":"scaled_demand","type":"double"}],"truncated":false}},"type":"Synapse.DataFrame"}},"token":"5afacf05-9f04-4e72-bd3e-352ba80c8308"}},"language":"python","ms_spell_check":{"ms_spell_check_language":"en"}},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/model_inferencing.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/model_inferencing.ipynb new file mode 100644 index 000000000..729f23bbd --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/model_inferencing.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"adfa3dd6-6d92-4eb3-810d-503b8544f221","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Retrieve the feature set for batch inference"]},{"cell_type":"code","execution_count":null,"id":"a733b1f0-172c-4642-843a-609b3ac9785c","metadata":{},"outputs":[],"source":["%run feature_set_retrieval"]},{"cell_type":"markdown","id":"2e796944-c15c-4656-a983-c3cc8c3c7a8d","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Batch inference"]},{"cell_type":"code","execution_count":null,"id":"9a9a8908-ccd3-470c-be65-509d20cc1d87","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["import mlflow \n","from synapse.ml.predict import MLFlowTransformer \n","\n","# Pick a model version with satisfied evaluation metrics\n","model_version = 1\n","\n","new_df = df.drop(\"pickup_timestamp\", \"scaled_demand\")\n","model = MLFlowTransformer( \n"," inputCols=new_df.columns,\n"," outputCol='scaled_demand',\n"," modelName='demand_prediction_model',\n"," modelVersion=model_version\n",") \n","batch_predictions = model.transform(new_df)"]},{"cell_type":"code","execution_count":null,"id":"89102bd3-1b4d-49e3-8247-23b74893f533","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["batch_predictions.show(5)"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python"},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{"application/vnd.jupyter.widget-state+json":{"state":{"004447f23e944f8ba8e00aa8f6dfb436":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"018b3a47a8314abca43dc00175aa7774":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_b2523a56ad5e40778fd1e57ef24ef7b2","IPY_MODEL_655bfa185a574618b9470155f2031462","IPY_MODEL_9fe60c346d1f47e29a80fb23252e5640"],"layout":"IPY_MODEL_36dac228089744239ad3e6d96dc0c074"}},"01c977149bf349f8858f6617ad427449":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"099e2b64e4e74a9aa6eb233ac55d1326":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"0a3051b4cbcd441ca03ef12f04882c46":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_8a60b77f59624f6fac2c3f00b4b1107f","style":"IPY_MODEL_c3e1776a73284ad6889bf57cf1cfd8e4","value":" 1/1 [00:00<00:00, 125.91it/s]"}},"0b1f5956e42f43d091eafc7480965e39":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0e061d7a2a8a4d7b922e039e94f120b4":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0ecd0a0ba1fe40679aa1e906dc3450ae":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0f373dcd84ca48e98f8f3e5a3f1a3e6f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_afa3d23174d041e3a2e521ccfa1de78e","IPY_MODEL_8ee9deef78e34a289f5e60efbb1084d3","IPY_MODEL_9eb3d6df087a4ffba784420ecf868d6e"],"layout":"IPY_MODEL_2524cc60a110413ca00ea286bb6431c1"}},"102b47cebf9b43d5a1b1e61f03aef158":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_44fda475bc5f436580ff6f7f0360a710","style":"IPY_MODEL_b6d8c23728a64ddabee029daa4d26336","value":" 6/6 [00:00<00:00, 51.40it/s]"}},"1081c895fca742239008eae9eaedb4c5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_95e691d9913244d6b03628ea94d74acc","max":8,"style":"IPY_MODEL_70eb1aab4f3f48449abb957cf4e3ea51","value":8}},"1656f62766ac43ceaea29c6f0db02232":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_843882531aa64669bc848600cd9220fc","max":6,"style":"IPY_MODEL_6a98c0d3a66a495680ceadf9852f20cf","value":6}},"1970205870124bd6b142f1f3ee03d1d0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_f699af27d3a3494987c54ad9aebddb89","style":"IPY_MODEL_e226a99abc754773978b88b5a1897897","value":"Calculating Metrics: 100%"}},"1a27d2481b8d4231836b2633b860302b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"1af729b669734af4bc94ec438aa3d94c":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"1bebdcaf99d8424b9ea4382ad8d87ca4":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"1ed9bd564e41413985421c9e2b88ae4d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"1f552688a73c4ce192e57de136574f0d":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"1fc70689c8cc4f1d90dacc2e5ab44586":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_a1923d78f384400ebaad53109b62b583","style":"IPY_MODEL_b426da7478d04f8a900142ea20cdfe93","value":" 8/8 [00:05<00:00, 1.50it/s]"}},"214388f3ffd445ba80939823b47239f5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"22d45b3aefe54f66a8def1eef1ea04ad":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"22ef816fad984942bd3f76730305b847":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"242787a512d5458a98134896e7c3d37b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2524cc60a110413ca00ea286bb6431c1":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"26cea904bbf74137988295bb27a5d19c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2cee8f764ebc42b48df71dd600d6727a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"2e479b43beaa42079d39a19b2f810513":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_9e179de97ae54b4c9add492e5b3192a3","max":1,"style":"IPY_MODEL_36d3c45221644912a0958caa47f66f2d","value":1}},"2e6085eb81824c8f98480aa8189e74b5":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2e8ef73cf092493892f6795939aaf3cd":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2f27a16df12e4644b63ed82512e00f48":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_0e061d7a2a8a4d7b922e039e94f120b4","max":6,"style":"IPY_MODEL_6d7ad0f9784445c2be08769c3060eed0","value":6}},"32cd3f08fd7d41dfa10861fd68d9af8b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"32d05d9786fd4d7585d67a123bf94760":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"3457713bdcc14c7f88fdd31bfa62f45a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"36d3c45221644912a0958caa47f66f2d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"36dac228089744239ad3e6d96dc0c074":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"37524c973d0741719c3c2d5fe79f9509":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_fe5f82b26e9343719abdf33c61d26830","style":"IPY_MODEL_f27abf115a504c44a979b63293173f1a","value":" 8/8 [00:05<00:00, 1.50it/s]"}},"376a64e530954106824c12b46b3c89c3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"3b839d0e7e8f4d86ad40e9a7f3eccd3a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_1f552688a73c4ce192e57de136574f0d","style":"IPY_MODEL_a52447e4120740d18bbaa3e32de32743","value":"Calculating Metrics: 100%"}},"3e48d1e877af44ebab42eaeecd718b71":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"3fa551bac6be438f94b528744bdbc50e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"432a77e4a74c446380582a9b4a3036e8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4363f468267f4b78ad1037db994ed5b8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_d4f17d2d5af849368d0ea40a2e4f42e7","max":23,"style":"IPY_MODEL_83386cde021c4ab1a4ccbad0301030bc","value":23}},"44fda475bc5f436580ff6f7f0360a710":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"45e886b4dcf9491eb66ac4e3a964bdd0":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"479fdc7e2d6e494e9fd32aae7c4b88de":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_93ea6576e8ba4ae39bac8d3cb36a6ce0","IPY_MODEL_2e479b43beaa42079d39a19b2f810513","IPY_MODEL_fe7f6ca0d3494e88b98855c27f463446"],"layout":"IPY_MODEL_a998e90371b04fd1908c233399286a37"}},"480175b1f155490f85c292938d6fd9fe":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"480839fe26d1442fb804004ba380f16a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"48b8b3a1febc44d8b527abb45a4c6e93":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"49cccc655ad140eaa0ce6b0682ac07ca":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4c7d8542266d4499ae0a27248ac9dad4":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"4caf6ac8eb6443dc930f180ec47e4449":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4e965d3296ad41b6a08cbcc22a46aa1a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"4f0d70e5711f422b8e2cf06682ff13b5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_22ef816fad984942bd3f76730305b847","style":"IPY_MODEL_e9100a732cd84ec99d6713e96836be1b","value":" 1/1 [00:00<00:00, 129.66it/s]"}},"52cc7c8e818043c4aff7ab677df9cae6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_3b839d0e7e8f4d86ad40e9a7f3eccd3a","IPY_MODEL_802d9d13243c42b3b9f4f68f9321e7c6","IPY_MODEL_ed654532b3d7439080c348131fc9c0c9"],"layout":"IPY_MODEL_ca73ebe5e4cd49e5897f3ddc028470cc"}},"534597d52d8645bc8757ff1425ab3a17":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"54a8a08b797d44a9a8561f68bdd55459":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"562ca2ade9e3465aa68bdab306444a39":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"576aded4582246d2b39f3cab9b56ed26":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"57d8d1f4e5634b8e90f6ecdb50cdc12d":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"58048cc7745940e18150a63c0497738a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_fa82ced395de44e6b8b70ff3ff113bf9","style":"IPY_MODEL_df7cc85b6690421d9aa7a0d2594b4049","value":"Calculating Metrics: 100%"}},"59bc4c86d0f045f9a042c2b5d070055b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_8bea0ad7f56b49bab68c17c320893f61","IPY_MODEL_dbaa0230ebc748488f48b228d4170fe9","IPY_MODEL_a84177c22a674df9a42f45812e3132c0"],"layout":"IPY_MODEL_ace1fc6106054eedbdf087bed1eadf44"}},"59f43dfca0054f15a99268f508124a97":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_b8253946f31d4045aa124d5d90ecbaee","IPY_MODEL_a4715e13af9f409aaec7428886ea5a20","IPY_MODEL_1fc70689c8cc4f1d90dacc2e5ab44586"],"layout":"IPY_MODEL_610c5f44dd37430cb2b42bdb4fce9a8b"}},"600fc63819c14197a4e9d0b91645149e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_945abfb337c848ddaf42d7484953e699","IPY_MODEL_ab6d06faa07841d280535b96517d727a","IPY_MODEL_0a3051b4cbcd441ca03ef12f04882c46"],"layout":"IPY_MODEL_4c7d8542266d4499ae0a27248ac9dad4"}},"60790356c19e42b8be6d813647e3cbfe":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_45e886b4dcf9491eb66ac4e3a964bdd0","style":"IPY_MODEL_bee7144b79bb41718fe8f8e79f779e3e","value":"Calculating Metrics: 100%"}},"610c5f44dd37430cb2b42bdb4fce9a8b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"6308e380d4f24f9fa9608e71e718aec1":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"655bfa185a574618b9470155f2031462":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_ca0edd0c4f29493f9547ccaa1fb983f3","max":1,"style":"IPY_MODEL_432a77e4a74c446380582a9b4a3036e8","value":1}},"6978e9f94e0b48698010fdd7b6ed2346":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"6a38df1026524f01a91ed24f280b49e0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"6a98c0d3a66a495680ceadf9852f20cf":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"6d7ad0f9784445c2be08769c3060eed0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"6f289eb908b34f21be8aaec3719e0fed":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"6f5de1653caf45309e3fbee63e0f782f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_6978e9f94e0b48698010fdd7b6ed2346","max":6,"style":"IPY_MODEL_49cccc655ad140eaa0ce6b0682ac07ca","value":6}},"70eb1aab4f3f48449abb957cf4e3ea51":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"728be7b6e2f740b6b5f7cee553b5fa0d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"73da67d9b5044d1585a97df692aa5439":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"7727e1a53f264d2cbf099204c076fafa":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_95689701bc6446e6844a1ab3ca80ddf6","IPY_MODEL_4363f468267f4b78ad1037db994ed5b8","IPY_MODEL_eccc28fbe2024199bf2d72f7f54b9505"],"layout":"IPY_MODEL_4e965d3296ad41b6a08cbcc22a46aa1a"}},"78572d2f8b394e73acacb3dd0e97926f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_d302f2180cc94fda98787ea3d278f6fa","style":"IPY_MODEL_eac72d9582d741f1839c6577a127f8f8","value":"Calculating Metrics: 100%"}},"802d9d13243c42b3b9f4f68f9321e7c6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_54a8a08b797d44a9a8561f68bdd55459","max":23,"style":"IPY_MODEL_1ed9bd564e41413985421c9e2b88ae4d","value":23}},"83386cde021c4ab1a4ccbad0301030bc":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"843882531aa64669bc848600cd9220fc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"867c1e780e994f2bab762ad9e97a8414":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_0ecd0a0ba1fe40679aa1e906dc3450ae","style":"IPY_MODEL_562ca2ade9e3465aa68bdab306444a39","value":" 6/6 [00:00<00:00, 33.84it/s]"}},"8a60b77f59624f6fac2c3f00b4b1107f":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"8bea0ad7f56b49bab68c17c320893f61":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_32cd3f08fd7d41dfa10861fd68d9af8b","style":"IPY_MODEL_8dc1642e60164570b4426180fdce16e7","value":"Calculating Metrics: 100%"}},"8c2aa368d6124f34800a0f74d92a3b3b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8c2c7e44a3a742ab8a84413d2a95e0af":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8c67e116afc34cdfa1c4e192e59851ff":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8d1c181c353b45d096c471435f03f7c0":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"8dc1642e60164570b4426180fdce16e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8ee9deef78e34a289f5e60efbb1084d3":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_22d45b3aefe54f66a8def1eef1ea04ad","max":6,"style":"IPY_MODEL_4caf6ac8eb6443dc930f180ec47e4449","value":6}},"93ea6576e8ba4ae39bac8d3cb36a6ce0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_99cd10935def4bf1aab2a185a830b761","style":"IPY_MODEL_bc3045b91698442396ec8ba1fe59aace","value":"Calculating Metrics: 100%"}},"945abfb337c848ddaf42d7484953e699":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_9f9d53657e1e4aa8b14fc410fd698c2c","style":"IPY_MODEL_099e2b64e4e74a9aa6eb233ac55d1326","value":"Calculating Metrics: 100%"}},"94e5499a5284494d90a4cbdbe2cf7c6b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"95689701bc6446e6844a1ab3ca80ddf6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_bc667c51b9f846979c760e2893f991d8","style":"IPY_MODEL_d85392f7b90548a58047e5f7f45f6436","value":"Calculating Metrics: 100%"}},"95e691d9913244d6b03628ea94d74acc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"99cd10935def4bf1aab2a185a830b761":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9e179de97ae54b4c9add492e5b3192a3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9eb3d6df087a4ffba784420ecf868d6e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_48b8b3a1febc44d8b527abb45a4c6e93","style":"IPY_MODEL_8c67e116afc34cdfa1c4e192e59851ff","value":" 6/6 [00:00<00:00, 49.85it/s]"}},"9f9d53657e1e4aa8b14fc410fd698c2c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9fe60c346d1f47e29a80fb23252e5640":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_e5e96afa05bc4888ac5b1489d0fb4893","style":"IPY_MODEL_214388f3ffd445ba80939823b47239f5","value":" 1/1 [00:00<00:00, 124.56it/s]"}},"a0a2c9d95264461e9d1252babf2f6656":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_58048cc7745940e18150a63c0497738a","IPY_MODEL_1081c895fca742239008eae9eaedb4c5","IPY_MODEL_37524c973d0741719c3c2d5fe79f9509"],"layout":"IPY_MODEL_fe7d5df2db1b4f6da36387d9b6d2a614"}},"a1923d78f384400ebaad53109b62b583":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"a4715e13af9f409aaec7428886ea5a20":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_26cea904bbf74137988295bb27a5d19c","max":8,"style":"IPY_MODEL_32d05d9786fd4d7585d67a123bf94760","value":8}},"a52447e4120740d18bbaa3e32de32743":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"a84177c22a674df9a42f45812e3132c0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_376a64e530954106824c12b46b3c89c3","style":"IPY_MODEL_6a38df1026524f01a91ed24f280b49e0","value":" 6/6 [00:00<00:00, 48.53it/s]"}},"a998e90371b04fd1908c233399286a37":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ab6d06faa07841d280535b96517d727a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_2e6085eb81824c8f98480aa8189e74b5","max":1,"style":"IPY_MODEL_534597d52d8645bc8757ff1425ab3a17","value":1}},"ac0d4a81cabf4dd7955f1bbaf38578a5":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"acd23c095ecc462eb363c4cf4323f276":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ace1fc6106054eedbdf087bed1eadf44":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"afa3d23174d041e3a2e521ccfa1de78e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_d6724465b2e7421a886e2d502ec68095","style":"IPY_MODEL_480175b1f155490f85c292938d6fd9fe","value":"Calculating Metrics: 100%"}},"b2523a56ad5e40778fd1e57ef24ef7b2":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_2e8ef73cf092493892f6795939aaf3cd","style":"IPY_MODEL_6308e380d4f24f9fa9608e71e718aec1","value":"Calculating Metrics: 100%"}},"b426da7478d04f8a900142ea20cdfe93":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"b4bcb692d56c456ca87bcaec59af9d4e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_6f289eb908b34f21be8aaec3719e0fed","style":"IPY_MODEL_8c2aa368d6124f34800a0f74d92a3b3b","value":" 6/6 [00:00<00:00, 50.83it/s]"}},"b6d8c23728a64ddabee029daa4d26336":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"b7dd1fd5391b48e3a5c20b757efe4a40":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_1970205870124bd6b142f1f3ee03d1d0","IPY_MODEL_fdba21199a194464ba4097a5f5c734f2","IPY_MODEL_d8730b191766440881bce0ab567f0d5b"],"layout":"IPY_MODEL_576aded4582246d2b39f3cab9b56ed26"}},"b8253946f31d4045aa124d5d90ecbaee":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_ac0d4a81cabf4dd7955f1bbaf38578a5","style":"IPY_MODEL_1bebdcaf99d8424b9ea4382ad8d87ca4","value":"Calculating Metrics: 100%"}},"bab97edbebe44417b266ba4be0370c28":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"bc3045b91698442396ec8ba1fe59aace":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"bc667c51b9f846979c760e2893f991d8":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"bee7144b79bb41718fe8f8e79f779e3e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"c3aebf50a10046a6aa535be64dc7dbe9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_480839fe26d1442fb804004ba380f16a","max":1,"style":"IPY_MODEL_1a27d2481b8d4231836b2633b860302b","value":1}},"c3e1776a73284ad6889bf57cf1cfd8e4":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"ca0edd0c4f29493f9547ccaa1fb983f3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ca73ebe5e4cd49e5897f3ddc028470cc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d302f2180cc94fda98787ea3d278f6fa":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d4f17d2d5af849368d0ea40a2e4f42e7":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d527245bcc444a20a726033b34b7e666":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_78572d2f8b394e73acacb3dd0e97926f","IPY_MODEL_c3aebf50a10046a6aa535be64dc7dbe9","IPY_MODEL_4f0d70e5711f422b8e2cf06682ff13b5"],"layout":"IPY_MODEL_01c977149bf349f8858f6617ad427449"}},"d5ae0e31155c46c0830a393e5c1b8cd5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_dd8f9257a4ae41b1b2e3834349745549","IPY_MODEL_6f5de1653caf45309e3fbee63e0f782f","IPY_MODEL_102b47cebf9b43d5a1b1e61f03aef158"],"layout":"IPY_MODEL_e8633d0e61724ab9ae9a3c181178694c"}},"d6724465b2e7421a886e2d502ec68095":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d85392f7b90548a58047e5f7f45f6436":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"d8730b191766440881bce0ab567f0d5b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_0b1f5956e42f43d091eafc7480965e39","style":"IPY_MODEL_2cee8f764ebc42b48df71dd600d6727a","value":" 6/6 [00:00<00:00, 44.57it/s]"}},"db3d4138c490459c8b1b68c029848c12":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_242787a512d5458a98134896e7c3d37b","style":"IPY_MODEL_94e5499a5284494d90a4cbdbe2cf7c6b","value":"Calculating Metrics: 100%"}},"dbaa0230ebc748488f48b228d4170fe9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_3457713bdcc14c7f88fdd31bfa62f45a","max":6,"style":"IPY_MODEL_3fa551bac6be438f94b528744bdbc50e","value":6}},"dd8f9257a4ae41b1b2e3834349745549":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_ed46ea00c53a4665ae057e8ac119a1da","style":"IPY_MODEL_bab97edbebe44417b266ba4be0370c28","value":"Calculating Metrics: 100%"}},"df7cc85b6690421d9aa7a0d2594b4049":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"e226a99abc754773978b88b5a1897897":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"e5e96afa05bc4888ac5b1489d0fb4893":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"e8633d0e61724ab9ae9a3c181178694c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"e9100a732cd84ec99d6713e96836be1b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"ea2d294662854523b77d217d4950a81b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_60790356c19e42b8be6d813647e3cbfe","IPY_MODEL_1656f62766ac43ceaea29c6f0db02232","IPY_MODEL_b4bcb692d56c456ca87bcaec59af9d4e"],"layout":"IPY_MODEL_fe8e972de1fc44ceaffaf2a343a5d6c8"}},"eac72d9582d741f1839c6577a127f8f8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"eccc28fbe2024199bf2d72f7f54b9505":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_73da67d9b5044d1585a97df692aa5439","style":"IPY_MODEL_728be7b6e2f740b6b5f7cee553b5fa0d","value":" 23/23 [00:10<00:00, 2.12it/s]"}},"ed46ea00c53a4665ae057e8ac119a1da":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ed654532b3d7439080c348131fc9c0c9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_8d1c181c353b45d096c471435f03f7c0","style":"IPY_MODEL_8c2c7e44a3a742ab8a84413d2a95e0af","value":" 23/23 [00:10<00:00, 2.18it/s]"}},"f27abf115a504c44a979b63293173f1a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"f2cb18a44a69406ea6a22490b6ee8426":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_db3d4138c490459c8b1b68c029848c12","IPY_MODEL_2f27a16df12e4644b63ed82512e00f48","IPY_MODEL_867c1e780e994f2bab762ad9e97a8414"],"layout":"IPY_MODEL_57d8d1f4e5634b8e90f6ecdb50cdc12d"}},"f699af27d3a3494987c54ad9aebddb89":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fa82ced395de44e6b8b70ff3ff113bf9":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fdba21199a194464ba4097a5f5c734f2":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_3e48d1e877af44ebab42eaeecd718b71","max":6,"style":"IPY_MODEL_1af729b669734af4bc94ec438aa3d94c","value":6}},"fe5f82b26e9343719abdf33c61d26830":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe7d5df2db1b4f6da36387d9b6d2a614":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe7f6ca0d3494e88b98855c27f463446":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_acd23c095ecc462eb363c4cf4323f276","style":"IPY_MODEL_004447f23e944f8ba8e00aa8f6dfb436","value":" 1/1 [00:00<00:00, 127.18it/s]"}},"fe8e972de1fc44ceaffaf2a343a5d6c8":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}}},"version_major":2,"version_minor":0}}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/model_training.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/model_training.ipynb new file mode 100644 index 000000000..e46db7559 --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/model_training.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"62fb334c-7832-488e-8d08-bd5f66b0e1e7","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["# Retrieve the feature set from Azure ML managed feature store"]},{"cell_type":"code","execution_count":null,"id":"0f4f6d26-911c-4a6e-b78c-35e553e33dd7","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run feature_set_retrieval"]},{"cell_type":"code","execution_count":null,"id":"67543523-4385-4026-a000-cbca6cb3ce30","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["transformed_df = df.toPandas()\n","transformed_df.head(5)"]},{"cell_type":"markdown","id":"2e453a94-ed81-4cc9-a9a2-a6ddf851cc39","metadata":{},"source":["# Track Machine Learning experiments and models\n","\n","A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data. Once you have trained the model, you can use it to reason over data that it hasn't seen before, and make predictions about that data.\n","\n","In this notebook, you will learn the basic steps to run an experiment, add a model version to track run metrics and parameters and register a model.\n"]},{"cell_type":"code","execution_count":null,"id":"e6bedd7f-6ab2-4f8c-b923-62735fe6a826","metadata":{},"outputs":[],"source":["import mlflow\n","\n","# Set given experiment as the active experiment. If an experiment with this name does not exist, a new experiment with this name is created.\n","ml_experiment_name = \"training-experiment\"\n","mlflow.set_experiment(ml_experiment_name)\n"]},{"cell_type":"code","execution_count":null,"id":"5ce8af4e-c9a2-45c2-ba0b-b3bcef2c8250","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["from pprint import pprint\n","from mlflow import MlflowClient\n","\n","def yield_artifacts(run_id, path=None):\n"," \"\"\"Yield all artifacts in the specified run\"\"\"\n"," client = MlflowClient()\n"," for item in client.list_artifacts(run_id, path):\n"," if item.is_dir:\n"," yield from yield_artifacts(run_id, item.path)\n"," else:\n"," yield item.path\n","\n","\n","def fetch_logged_data(run_id):\n"," \"\"\"Fetch params, metrics, tags, and artifacts in the specified run\"\"\"\n"," client = MlflowClient()\n"," data = client.get_run(run_id).data\n"," tags = {k: v for k, v in data.tags.items() if not k.startswith(\"mlflow.\")}\n"," artifacts = list(yield_artifacts(run_id))\n"," return {\n"," \"params\": data.params,\n"," \"metrics\": data.metrics,\n"," \"tags\": tags,\n"," \"artifacts\": artifacts,\n"," }"]},{"cell_type":"code","execution_count":null,"id":"9ef3b18d-e2ad-4a76-875c-a58d52120592","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["import numpy as np \n","from mlflow.models import infer_signature\n","from sklearn import svm\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n","\n","# Fill None values with 0\n","final_df = (\n"," transformed_df.drop([\"pickup_timestamp\"], axis=1, errors=\"ignore\").fillna(0)\n",")\n","\n","seed = 13\n","X_train, X_test, y_train, y_test = train_test_split(\n"," final_df.drop([\"scaled_demand\"], axis=1),\n"," final_df[\"scaled_demand\"],\n"," test_size=0.2, \n"," random_state=seed\n",")\n","\n","params = {\n"," \"C\": 10,\n"," \"kernel\": 'rbf',\n"," \"degree\": 8,\n"," \"gamma\": 'auto',\n"," \"tol\": 0.0001\n","}"]},{"cell_type":"code","execution_count":null,"id":"0fbb6019-7f8f-4c5d-8f58-fb9482d7519e","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["with mlflow.start_run() as run:\n"," clf = svm.SVR(**params).fit(X_train, y_train)\n","\n"," # Predict on the validation set\n"," y_pred = clf.predict(X_test)\n","\n"," # Calculate error metrics\n"," mae = mean_absolute_error(y_test, y_pred)\n"," mse = mean_squared_error(y_test, y_pred)\n"," rmse = np.sqrt(mse)\n"," r2 = r2_score(y_test, y_pred)\n","\n"," # Assemble the metrics we're going to write into a collection\n"," metrics = {\"mae\": mae, \"mse\": mse, \"rmse\": rmse, \"r2\": r2}\n"," \n"," signature = infer_signature(X_test, y_test)\n","\n"," # Log the parameters used for the model fit\n"," mlflow.log_params(params)\n","\n"," # Log the error metrics that were calculated during validation\n"," mlflow.log_metrics(metrics)\n","\n"," # Activate the MLFlow logging API to log your model artifacts\n"," mlflow.sklearn.log_model(clf, \"demand_prediction_model\", signature=signature)\n"," print(\"Model saved in run_id=%s\" % run.info.run_id)\n","\n"," # Register the model produced from your training job.\n"," mv = mlflow.register_model(\n"," \"runs:/{}/demand_prediction_model\".format(run.info.run_id), \"demand_prediction_model\"\n"," )"]},{"cell_type":"code","execution_count":null,"id":"55738e71-4311-4a58-846c-290f04dbb018","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["run_id = mlflow.last_active_run().info.run_id\n","print(f\"Logged data and model in run: {run_id}\")\n","\n","# show logged data\n","for key, data in fetch_logged_data(run_id).items():\n"," print(f\"\\n---------- logged {key} ----------\")\n"," pprint(data)"]},{"cell_type":"markdown","id":"ff01f0c3-5caf-4223-8a0f-2c4a4b6b977d","metadata":{"nteract":{"transient":{"deleting":false}}},"source":["## Register model training lineage to Purview"]},{"cell_type":"code","execution_count":null,"id":"0ed85e70-d44d-495e-aee1-2b70ba173c2c","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["%run data_catalog_and_lineage"]},{"cell_type":"code","execution_count":null,"id":"8bfab779-72a6-4536-8492-cc17be1f6c27","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["purview_data_catalog = PurviewDataCatalog()\n","\n","# Create features assets\n","fset = featurestore.feature_sets.get(featureset_name, all_featuresets[featureset_name])\n","target_features = final_df.columns\n","feature_assets = purview_data_catalog.prepare_feature_assets(featurestore_name,\n"," fset,\n"," target_features,\n"," **{\"tenant_id\": tenant_id,\n"," \"subscription_id\": featurestore_subscription_id,\n"," \"resource_group\": featurestore_resource_group_name})\n","# Prepare AML custom types\n","purview_data_catalog.prepare_aml_custom_types()\n","\n","# Create model training notebook process asset\n","current_notebook_context = mssparkutils.notebook.nb.context\n","workspace_id = current_notebook_context[\"currentWorkspaceId\"]\n","notebook_id = current_notebook_context[\"currentNotebookId\"]\n","notebook_name = \"model_training\"\n","process_data_asset = DataAsset(f\"{notebook_name} (Fabric notebook)\",\n"," \"process\",\n"," f\"https://{fabric_tenant}.powerbi.com/groups/{workspace_id}/synapsenotebooks/{notebook_id}\")\n","\n","# Create Azure ML experiment asset\n","ml_experiment_run = mlflow.get_run(mv.run_id).to_dictionary()\n","ml_artifact_uri = ml_experiment_run[\"info\"][\"artifact_uri\"]\n","ml_experiment_id = ml_artifact_uri.split(f\"{workspace_id}/\")[1].split(\"/\")[0]\n","ml_experiment_fqn = f\"https://msit.powerbi.com/groups/{workspace_id}/mlexperiments/{ml_experiment_id}?experience=data-science\"\n","\n","ml_experiment_asset = DataAsset(ml_experiment_name,\n"," \"ml_experiment\",\n"," ml_experiment_fqn)\n","\n","# Register lineage like features -> model training notebook -> ML experiment\n","training_model_lineage = DataLineage(input_data_assets=feature_assets,\n"," output_data_assets=[ml_experiment_asset],\n"," process_asset=process_data_asset)\n","purview_data_catalog.register_lineage(training_model_lineage)\n"]},{"cell_type":"code","execution_count":null,"id":"5543fff7-8ec3-4f50-9356-3a1f6e77fd3c","metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":["# Create Azure ML model asset\n","ml_model_id = mv.source.split(\".dfs.core.windows.net/\")[1].split(\"/\")[0]\n","ml_model_name = mv.name\n","ml_experiment_run = mlflow.get_run(mv.run_id).to_dictionary()\n","ml_experiment_run_name = ml_experiment_run[\"data\"][\"tags\"][\"mlflow.runName\"]\n","\n","ml_model_fqn = f\"https://msit.powerbi.com/groups/{workspace_id}/mlmodels/{ml_model_id}?experience=data-science\"\n","ml_model_asset = DataAsset(ml_model_name,\n"," \"ml_model\",\n"," ml_model_fqn,\n"," custom_properties={\"version\": mv.version,\n"," \"experimentRunName\": ml_experiment_run_name},\n"," relationship_attributes=[{\"type\": \"sources\",\n"," \"qualified_name\": ml_experiment_fqn}])\n","\n","# Register ML model entity connecting to ML experiment, without process node\n","purview_data_catalog.register_entity(ml_model_asset)\n"]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python","ms_spell_check":{"ms_spell_check_language":"en"}},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"orig_nbformat":4,"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"},"widgets":{"application/vnd.jupyter.widget-state+json":{"state":{"004447f23e944f8ba8e00aa8f6dfb436":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"099e2b64e4e74a9aa6eb233ac55d1326":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"0a3051b4cbcd441ca03ef12f04882c46":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_8a60b77f59624f6fac2c3f00b4b1107f","style":"IPY_MODEL_c3e1776a73284ad6889bf57cf1cfd8e4","value":" 1/1 [00:00<00:00, 125.91it/s]"}},"0e061d7a2a8a4d7b922e039e94f120b4":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0ecd0a0ba1fe40679aa1e906dc3450ae":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"0f373dcd84ca48e98f8f3e5a3f1a3e6f":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_afa3d23174d041e3a2e521ccfa1de78e","IPY_MODEL_8ee9deef78e34a289f5e60efbb1084d3","IPY_MODEL_9eb3d6df087a4ffba784420ecf868d6e"],"layout":"IPY_MODEL_2524cc60a110413ca00ea286bb6431c1"}},"1081c895fca742239008eae9eaedb4c5":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_95e691d9913244d6b03628ea94d74acc","max":8,"style":"IPY_MODEL_70eb1aab4f3f48449abb957cf4e3ea51","value":8}},"22d45b3aefe54f66a8def1eef1ea04ad":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"242787a512d5458a98134896e7c3d37b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2524cc60a110413ca00ea286bb6431c1":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2e479b43beaa42079d39a19b2f810513":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_9e179de97ae54b4c9add492e5b3192a3","max":1,"style":"IPY_MODEL_36d3c45221644912a0958caa47f66f2d","value":1}},"2e6085eb81824c8f98480aa8189e74b5":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"2f27a16df12e4644b63ed82512e00f48":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_0e061d7a2a8a4d7b922e039e94f120b4","max":6,"style":"IPY_MODEL_6d7ad0f9784445c2be08769c3060eed0","value":6}},"32cd3f08fd7d41dfa10861fd68d9af8b":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"3457713bdcc14c7f88fdd31bfa62f45a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"36d3c45221644912a0958caa47f66f2d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"37524c973d0741719c3c2d5fe79f9509":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_fe5f82b26e9343719abdf33c61d26830","style":"IPY_MODEL_f27abf115a504c44a979b63293173f1a","value":" 8/8 [00:05<00:00, 1.50it/s]"}},"376a64e530954106824c12b46b3c89c3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"3fa551bac6be438f94b528744bdbc50e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4363f468267f4b78ad1037db994ed5b8":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_d4f17d2d5af849368d0ea40a2e4f42e7","max":23,"style":"IPY_MODEL_83386cde021c4ab1a4ccbad0301030bc","value":23}},"479fdc7e2d6e494e9fd32aae7c4b88de":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_93ea6576e8ba4ae39bac8d3cb36a6ce0","IPY_MODEL_2e479b43beaa42079d39a19b2f810513","IPY_MODEL_fe7f6ca0d3494e88b98855c27f463446"],"layout":"IPY_MODEL_a998e90371b04fd1908c233399286a37"}},"480175b1f155490f85c292938d6fd9fe":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"48b8b3a1febc44d8b527abb45a4c6e93":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"4c7d8542266d4499ae0a27248ac9dad4":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"4caf6ac8eb6443dc930f180ec47e4449":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"4e965d3296ad41b6a08cbcc22a46aa1a":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"534597d52d8645bc8757ff1425ab3a17":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"562ca2ade9e3465aa68bdab306444a39":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"57d8d1f4e5634b8e90f6ecdb50cdc12d":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"58048cc7745940e18150a63c0497738a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_fa82ced395de44e6b8b70ff3ff113bf9","style":"IPY_MODEL_df7cc85b6690421d9aa7a0d2594b4049","value":"Calculating Metrics: 100%"}},"59bc4c86d0f045f9a042c2b5d070055b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_8bea0ad7f56b49bab68c17c320893f61","IPY_MODEL_dbaa0230ebc748488f48b228d4170fe9","IPY_MODEL_a84177c22a674df9a42f45812e3132c0"],"layout":"IPY_MODEL_ace1fc6106054eedbdf087bed1eadf44"}},"600fc63819c14197a4e9d0b91645149e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_945abfb337c848ddaf42d7484953e699","IPY_MODEL_ab6d06faa07841d280535b96517d727a","IPY_MODEL_0a3051b4cbcd441ca03ef12f04882c46"],"layout":"IPY_MODEL_4c7d8542266d4499ae0a27248ac9dad4"}},"6a38df1026524f01a91ed24f280b49e0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"6d7ad0f9784445c2be08769c3060eed0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"70eb1aab4f3f48449abb957cf4e3ea51":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"728be7b6e2f740b6b5f7cee553b5fa0d":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"73da67d9b5044d1585a97df692aa5439":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"7727e1a53f264d2cbf099204c076fafa":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_95689701bc6446e6844a1ab3ca80ddf6","IPY_MODEL_4363f468267f4b78ad1037db994ed5b8","IPY_MODEL_eccc28fbe2024199bf2d72f7f54b9505"],"layout":"IPY_MODEL_4e965d3296ad41b6a08cbcc22a46aa1a"}},"83386cde021c4ab1a4ccbad0301030bc":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"ProgressStyleModel","state":{"description_width":""}},"867c1e780e994f2bab762ad9e97a8414":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_0ecd0a0ba1fe40679aa1e906dc3450ae","style":"IPY_MODEL_562ca2ade9e3465aa68bdab306444a39","value":" 6/6 [00:00<00:00, 33.84it/s]"}},"8a60b77f59624f6fac2c3f00b4b1107f":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"8bea0ad7f56b49bab68c17c320893f61":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_32cd3f08fd7d41dfa10861fd68d9af8b","style":"IPY_MODEL_8dc1642e60164570b4426180fdce16e7","value":"Calculating Metrics: 100%"}},"8c67e116afc34cdfa1c4e192e59851ff":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8dc1642e60164570b4426180fdce16e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"8ee9deef78e34a289f5e60efbb1084d3":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_22d45b3aefe54f66a8def1eef1ea04ad","max":6,"style":"IPY_MODEL_4caf6ac8eb6443dc930f180ec47e4449","value":6}},"93ea6576e8ba4ae39bac8d3cb36a6ce0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_99cd10935def4bf1aab2a185a830b761","style":"IPY_MODEL_bc3045b91698442396ec8ba1fe59aace","value":"Calculating Metrics: 100%"}},"945abfb337c848ddaf42d7484953e699":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_9f9d53657e1e4aa8b14fc410fd698c2c","style":"IPY_MODEL_099e2b64e4e74a9aa6eb233ac55d1326","value":"Calculating Metrics: 100%"}},"94e5499a5284494d90a4cbdbe2cf7c6b":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"95689701bc6446e6844a1ab3ca80ddf6":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_bc667c51b9f846979c760e2893f991d8","style":"IPY_MODEL_d85392f7b90548a58047e5f7f45f6436","value":"Calculating Metrics: 100%"}},"95e691d9913244d6b03628ea94d74acc":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"99cd10935def4bf1aab2a185a830b761":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9e179de97ae54b4c9add492e5b3192a3":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"9eb3d6df087a4ffba784420ecf868d6e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_48b8b3a1febc44d8b527abb45a4c6e93","style":"IPY_MODEL_8c67e116afc34cdfa1c4e192e59851ff","value":" 6/6 [00:00<00:00, 49.85it/s]"}},"9f9d53657e1e4aa8b14fc410fd698c2c":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"a0a2c9d95264461e9d1252babf2f6656":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_58048cc7745940e18150a63c0497738a","IPY_MODEL_1081c895fca742239008eae9eaedb4c5","IPY_MODEL_37524c973d0741719c3c2d5fe79f9509"],"layout":"IPY_MODEL_fe7d5df2db1b4f6da36387d9b6d2a614"}},"a84177c22a674df9a42f45812e3132c0":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_376a64e530954106824c12b46b3c89c3","style":"IPY_MODEL_6a38df1026524f01a91ed24f280b49e0","value":" 6/6 [00:00<00:00, 48.53it/s]"}},"a998e90371b04fd1908c233399286a37":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ab6d06faa07841d280535b96517d727a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_2e6085eb81824c8f98480aa8189e74b5","max":1,"style":"IPY_MODEL_534597d52d8645bc8757ff1425ab3a17","value":1}},"acd23c095ecc462eb363c4cf4323f276":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"ace1fc6106054eedbdf087bed1eadf44":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"afa3d23174d041e3a2e521ccfa1de78e":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_d6724465b2e7421a886e2d502ec68095","style":"IPY_MODEL_480175b1f155490f85c292938d6fd9fe","value":"Calculating Metrics: 100%"}},"bc3045b91698442396ec8ba1fe59aace":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"bc667c51b9f846979c760e2893f991d8":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"c3e1776a73284ad6889bf57cf1cfd8e4":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"d4f17d2d5af849368d0ea40a2e4f42e7":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d6724465b2e7421a886e2d502ec68095":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"d85392f7b90548a58047e5f7f45f6436":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"db3d4138c490459c8b1b68c029848c12":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_242787a512d5458a98134896e7c3d37b","style":"IPY_MODEL_94e5499a5284494d90a4cbdbe2cf7c6b","value":"Calculating Metrics: 100%"}},"dbaa0230ebc748488f48b228d4170fe9":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"FloatProgressModel","state":{"bar_style":"success","layout":"IPY_MODEL_3457713bdcc14c7f88fdd31bfa62f45a","max":6,"style":"IPY_MODEL_3fa551bac6be438f94b528744bdbc50e","value":6}},"df7cc85b6690421d9aa7a0d2594b4049":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"eccc28fbe2024199bf2d72f7f54b9505":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_73da67d9b5044d1585a97df692aa5439","style":"IPY_MODEL_728be7b6e2f740b6b5f7cee553b5fa0d","value":" 23/23 [00:10<00:00, 2.12it/s]"}},"f27abf115a504c44a979b63293173f1a":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLStyleModel","state":{"description_width":"","font_size":null,"text_color":null}},"f2cb18a44a69406ea6a22490b6ee8426":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HBoxModel","state":{"children":["IPY_MODEL_db3d4138c490459c8b1b68c029848c12","IPY_MODEL_2f27a16df12e4644b63ed82512e00f48","IPY_MODEL_867c1e780e994f2bab762ad9e97a8414"],"layout":"IPY_MODEL_57d8d1f4e5634b8e90f6ecdb50cdc12d"}},"fa82ced395de44e6b8b70ff3ff113bf9":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe5f82b26e9343719abdf33c61d26830":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe7d5df2db1b4f6da36387d9b6d2a614":{"model_module":"@jupyter-widgets/base","model_module_version":"2.0.0","model_name":"LayoutModel","state":{}},"fe7f6ca0d3494e88b98855c27f463446":{"model_module":"@jupyter-widgets/controls","model_module_version":"2.0.0","model_name":"HTMLModel","state":{"layout":"IPY_MODEL_acd23c095ecc462eb363c4cf4323f276","style":"IPY_MODEL_004447f23e944f8ba8e00aa8f6dfb436","value":" 1/1 [00:00<00:00, 127.18it/s]"}}},"version_major":2,"version_minor":0}}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/utils.ipynb b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/utils.ipynb new file mode 100644 index 000000000..e49a85d7e --- /dev/null +++ b/single_tech_samples/fabric/feature_engineering_on_fabric/src/notebooks/utils.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"code","execution_count":null,"id":"d542764b-2f2b-4c6f-8ed0-dae02c2f5dbb","metadata":{},"outputs":[],"source":["import re\n","\n","def get_onelake_info():\n"," \"\"\" Get Fabric OneLake tenant, workspace ID and lakehouse ID from current attached lakehouse\n"," \"\"\"\n"," mount_source = \"\"\n"," fabric_onelake_tenant = \"\"\n"," fabric_workspace_id = \"\"\n"," fabric_lakehouse_id = \"\"\n"," mount_point_infos = mssparkutils.fs.mounts()\n","\n"," for mount_point_info in mount_point_infos:\n"," if mount_point_info.localPath == \"/lakehouse/default\":\n"," mount_source = mount_point_info.source\n","\n"," if mount_source:\n"," fabric_onelake_tenant = mount_source.split(\".dfs.fabric.microsoft.com\")[0].split(\"@\")[1]\n"," fabric_workspace_id = re.search(\"abfss://(.*)@\", mount_source, flags=0).group(1)\n"," fabric_lakehouse_id = re.search(\"dfs.fabric.microsoft.com/(.*)$\", mount_source, flags=0).group(1)\n"," \n"," return fabric_onelake_tenant, fabric_workspace_id, fabric_lakehouse_id\n"]}],"metadata":{"language_info":{"name":"python"},"microsoft":{"host":{},"language":"python"},"notebook_environment":{},"nteract":{"version":"nteract-front-end@1.0.0"},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{},"enableDebugMode":false}},"synapse_widget":{"state":{},"version":"0.1"}},"nbformat":4,"nbformat_minor":5} diff --git a/single_tech_samples/purview/README.md b/single_tech_samples/purview/README.md index 874c3c58e..de5bb98b1 100644 --- a/single_tech_samples/purview/README.md +++ b/single_tech_samples/purview/README.md @@ -22,12 +22,12 @@ Azure Purview is a managed data governance cloud service for centrally managing - [Setup and Deployment](#setup-and-deployment) - [Deployed Resources](#deployed-resources) - [Optional scripts](#optional-scripts) - - [search_advanced.sh](#search_advancedsh) - - [delete_purview_data.sh](#delete_purview_datash) + - [search\_advanced.sh](#search_advancedsh) + - [delete\_purview\_data.sh](#delete_purview_datash) ## Solution Overview -This solution sets up an [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) storage account, with a folder structure that enables data tiering (bronze, silver, gold), an [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/)(ADF) instance with linked services connecting to the data lake, a [key vault](https://azure.microsoft.com/en-us/services/key-vault/) for any secrets, and an [Azure Purview](https://azure.microsoft.com/en-us/services/purview/) account for data governance. +This solution sets up an [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) storage account, with a folder structure that enables data tiering (bronze, silver, gold), an [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/)(ADF) instance with linked services connecting to the data lake, a [key vault](https://azure.microsoft.com/en-us/services/key-vault/) for any secrets, and an [Azure Purview](https://azure.microsoft.com/en-us/products/purview/) account for data governance. The Azure Data Factory contains a simple ADF Pipeline that is stored in a git repository, that is taking data from the Dropzone and ingesting it into the bronze folder. @@ -41,10 +41,10 @@ The following shows the simple architecture of the Azure Purview connected to fe ### Technologies used -- [Azure Purview](https://azure.microsoft.com/en-au/services/devops/) -- [Azure Data Factory](https://azure.microsoft.com/en-au/services/data-factory/) -- [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) -- [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) +- [Azure Purview](https://azure.microsoft.com/en-us/products/devops/) +- [Azure Data Factory](https://azure.microsoft.com/en-us/products/data-factory/) +- [Azure Data Lake Gen2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction) +- [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/) ## Key Learnings @@ -100,7 +100,7 @@ To view the API Swagger documentation, download [PurviewCatalogAPISwagger.zip](h ### Prerequisites 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: ability to create and deploy to an azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. #### Software pre-requisites diff --git a/single_tech_samples/storage/sample1_data_lifecycle/README.md b/single_tech_samples/storage/sample1_data_lifecycle/README.md index 7c974c473..12d6866fc 100644 --- a/single_tech_samples/storage/sample1_data_lifecycle/README.md +++ b/single_tech_samples/storage/sample1_data_lifecycle/README.md @@ -5,7 +5,7 @@ - [1. Solution Overview](#1-solution-overview) - [1.1. Scope](#11-scope) - [1.2. Use Case](#12-use-case) - - [1.3. Technologies used](#14-technologies-used) + - [1.3. Technologies used](#13-technologies-used) - [2. How to use this sample](#2-how-to-use-this-sample) - [2.1. Prerequisites](#21-prerequisites) - [2.1.1 Software Prerequisites](#211-software-prerequisites) @@ -14,6 +14,7 @@ - [2.4. Deployment validation and Execution](#24-deployment-validation-and-execution) - [2.5. Clean-up](#25-clean-up) - [3. Troubleshooting](#3-troubleshooting) + - [Error Building AzureRM Client](#error-building-azurerm-client) - [4. References](#4-references) ## 1. Solution Overview @@ -42,7 +43,7 @@ Data lifecycle is one such feature on Azure using which you can optimize costs b The following technologies are used to build this sample: -- [Azure Storage](https://azure.microsoft.com/services/storage/data-lake-storage) +- [Azure Storage](https://azure.microsoft.com/products/storage/data-lake-storage) - [Terraform](https://www.terraform.io) ## 2. How to use this sample @@ -54,7 +55,7 @@ This section holds the information about usage instructions of this sample. The following are the prerequisites for deploying this sample: - [Github account](https://github.com) -- [Azure Account](https://azure.microsoft.com/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +- [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: The ability to create and deploy to an Azure [resource group](https://docs.microsoft.com/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/azure/role-based-access-control/overview) to the service principal over the resource group. - Active subscription with `Microsoft.Storage` [resource providers](https://docs.microsoft.com/azure/azure-resource-manager/management/azure-services-resource-providers) enabled. diff --git a/single_tech_samples/streamanalytics/README.md b/single_tech_samples/streamanalytics/README.md index 84a37271a..44aad16da 100644 --- a/single_tech_samples/streamanalytics/README.md +++ b/single_tech_samples/streamanalytics/README.md @@ -2,7 +2,7 @@ ![introductory diagram](./docs/images/ASA-job.PNG) -[Azure Stream Analytics](https://azure.microsoft.com/en-us/services/stream-analytics/) is a serverless real-time analytics service. The goal of this sample is to demonstrate how to develop a streaming pipeline, with IaC and testability in mind. +[Azure Stream Analytics](https://azure.microsoft.com/products/stream-analytics/) is a serverless real-time analytics service. The goal of this sample is to demonstrate how to develop a streaming pipeline, with IaC and testability in mind. ## Prerequisites diff --git a/single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/README.md b/single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/README.md index f06a72f97..cbd5588c3 100644 --- a/single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/README.md +++ b/single_tech_samples/synapseanalytics/sample1_loading_dynamic_modules/README.md @@ -15,6 +15,7 @@ - [2.4. Deployment validation and Execution](#24-deployment-validation-and-execution) - [2.5. Clean-up](#25-clean-up) - [3. Troubleshooting](#3-troubleshooting) + - [SqlServerPasswordTooShort](#sqlserverpasswordtooshort) ## 1. Solution Overview @@ -59,8 +60,8 @@ The below diagram illustrates the design and the flow of the system: The following technologies are used to build this sample: -- [Azure Synapse Analytics](https://azure.microsoft.com/en-in/services/synapse-analytics/) -- [Azure Storage](https://azure.microsoft.com/en-au/services/storage/data-lake-storage/) +- [Azure Synapse Analytics](https://azure.microsoft.com/en-us/products/synapse-analytics/) +- [Azure Storage](https://azure.microsoft.com/en-us/products/storage/data-lake-storage/) ## 2. How to use this sample @@ -71,7 +72,7 @@ This section holds the information about usage instructions of this sample. The following are the prerequisites for deploying this sample: 1. [Github account](https://github.com/) -2. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +2. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: The ability to create and deploy to an Azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [collaborator role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. - Active subscription with the following [resource providers](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-services-resource-providers) enabled: diff --git a/single_tech_samples/synapseanalytics/sample3_integration_testing/README.md b/single_tech_samples/synapseanalytics/sample3_integration_testing/README.md index 6463f2274..89bedd3a3 100644 --- a/single_tech_samples/synapseanalytics/sample3_integration_testing/README.md +++ b/single_tech_samples/synapseanalytics/sample3_integration_testing/README.md @@ -28,7 +28,7 @@ The solution runs a flow triggered on a storage file upload. It then runs a samp It makes use of the following azure services: -- [Azure Synapse Analytics](https://azure.microsoft.com/en-au/services/synapse-analytics/) +- [Azure Synapse Analytics](https://azure.microsoft.com/products/synapse-analytics/) - [Azure Data Lake Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) ## Key Concepts @@ -49,7 +49,7 @@ It makes use of the following azure services: ### Prerequisites -1. [Azure Account](https://azure.microsoft.com/en-au/free/search/?&ef_id=Cj0KCQiAr8bwBRD4ARIsAHa4YyLdFKh7JC0jhbxhwPeNa8tmnhXciOHcYsgPfNB7DEFFGpNLTjdTPbwaAh8bEALw_wcB:G:s&OCID=AID2000051_SEM_O2ShDlJP&MarinID=O2ShDlJP_332092752199_azure%20account_e_c__63148277493_aud-390212648371:kwd-295861291340&lnkd=Google_Azure_Brand&dclid=CKjVuKOP7uYCFVapaAoddSkKcA) +1. [Azure Account](https://azure.microsoft.com/en-us/free/) - *Permissions needed*: ability to create and deploy to an Azure [resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/overview), a [service principal](https://docs.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals), and grant the [Contributor role](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to the service principal over the resource group. #### Software pre-requisites if you don't use dev container