In this lab you will use Azure Data Factory to download New York City images to your data lake. Then, as part of the same pipeline, you are going to use an Azure Databricks notebook to invoke Computer Vision Cognitive Service to generate metadata documents and save them in back in your data lake. The Azure Data Factory pipeline then finishes by saving all metadata information in a Cosmos DB collection. You will use Power BI to visualise NYC images and their AI-generated metadata.
The estimated time to complete this lab is: 60 minutes.
In this section you will create a container in your MDWDataLake that will be used as a repository for the NYC image files. You will copy 30 files from the MDWResources Storage Account into your NYCTaxiData container.
-
Use Storage Explorer to create 2 containers:
-
Add a container for the images. click + Container.
-
On the New container blade, enter the following details:
- Name: nycimages
- Public access level: Blob (anonymous read access for blobs only) -
Repeat the process to create the
NYCImageMetadata
container. This container will be used to host the metadata files generated by Cognitive Services before they can be saved in Cosmos DB.
In this section you will create a CosmosDB database called NYC and a collection called ImageMetadata that will host New York image metadata information.
-
In the Azure Portal, go to the lab resource group and locate the CosmosDB account MDWCosmosDB-suffix.
-
On the Overview panel, click + Add Container.
-
On the Add Container blade, enter the following details:
- Database id > Create new: NYC
- Container id: ImageMetadata
- Partition key: /requestId
- Throughput: 400
- Unique keys: /requestId -
Click OK to create the container.
It should be called MDWComputerVsion under your Resource Group.
API Key: API Endpoint:
In this section you will import a Databricks notebook to your workspace and fill out the missing details about your Computer Vision API and your Data Lake account. This notebook will be executed from an Azure Data Factory pipeline and it will invoke the Computer Vision API to generate metadata about the images and save the result back to your data lake.
-
Open your dbx workspace
-
On the Workspace blade, click your username under the Users menu.
-
On the Users blade, click the arrow next to your user name and then Import.
-
On the Import Notebooks pop up window, choose
File
and select NYCImageMetadata-Lab.dbc -
Click Import.
-
On the NYCImageMetadata-Lab notebook, change your Computer Vision API details and data lake account details.
-
Attach the notebook to your previously created MDWDatabricksCluster cluster.
-
Review the notebook code.
-
If you want to test it, you can copy any publicly available image URL and paste it in the Image URL notebook parameter. You can use any of the following image URLs in the list as examples:
Test Image URLs:
- https://media.moddb.com/images/downloads/1/105/104871/Test.png
- https://static.pexels.com/photos/4204/nature-lawn-blur-flower.jpg
- https://image.redbull.com/rbcom/052/2017-05-22/89eef344-d24f-4520-8680-8b8f7508b264/0012/0/0/0/2428/3642/800/1/best-beginner-motocross-bikes-ktm-250-sx-f.jpg
-
Click Run All to execute the notebook.
-
After a successful execution you will notice that a new JSON file has been saved in the NYCImageMetadata container in your Data Lake.
-
Navigate to Azure Portal and check the contents of the nycimagemetadata container in your MDWDataLakesuffix storage account.
-
Download the file generated to inspect its contents.
IMPORTANT: Delete this test file before moving to next steps of this exercise.
Now that we can obtain metadata about random images, let's build an ADF pipeline that copies NYC images to our data lake from our data vendor and then use the dbx notebook to obtain the metadata.
First, we have to allow dbx to be "callable" from ADF.
-
On the Azure Databricks portal, click the User icon on the top right-hand corner of the screen.
-
Click on the User Settings menu item.
-
On the User Settings blade, under the Access Tokens tab, click Generate New Token.
-
On the Generate New Token pop-up window, enter “Azure Data Factory Access” in the Comment field. Leave Lifetime (days) with the default value of 90 days.
-
Click Generate.
-
IMPORTANT: Copy the generated access token to Notepad and save it. You won’t be able to retrieve it once you close this window.
Token:
![](./Media/Lab4-Image27.png)
-
Open the Azure Data Factory portal and click the Author (pencil icon) option on the left-hand side panel. Under Connections tab, click Linked Services and then click + New to create a new linked service connection.
-
On the New Linked Service blade, click the Compute tab.
-
Type “Azure Databricks” in the search box to find the Azure Databricks linked service.
-
Click Continue.
-
On the New Linked Service (Azure Databricks) blade, enter the following details:
- Name: MDWDatabricks
- Connect via integration runtime: AutoResolveIntegrationRuntime
- Account selection method: From Azure subscription
- Azure subscription: [select your subscription]
- Databricks workspace: MDWDatabricks-suffix
- Select cluster: Existing interactive cluster
- Access token:
- Choose from existing clusters: MDWDatabricksCluster -
Click Test connection to make sure you entered the correct connection details. You should see a “Connection successful” message above the button.
-
If the connection was successful, then click Finish. If you got an error message, please review the connection details above and try again.
In this section you will create a CosmosDB linked service in Azure Data Factory. CosmosDB will be used as the final repository of image metadata information generated by the Computer Vision API. Power BI will then be used to visualise the CosmosDB data.
-
On the New Linked Service blade, click the Data Store tab.
-
Type “Cosmos DB” in the search box to find the Azure Cosmos DB (SQL API) linked service.
-
Click Continue.
-
On the New Linked Service (Azure Databricks) blade, enter the following details:
- Name: MDWCosmosDB
- Connect via integration runtime: AutoResolveIntegrationRuntime
- Account selection method: From Azure subscription
- Azure subscription: [select your subscription]
- Cosmos DB account name: mdwcosmosdb-suffix
- Database name: NYC
In this section you will create 4 Azure Data Factory data sets that will be used in the data pipeline.
Dataset | Description |
---|---|
MDWResources_NYCImages | References MDWResources shared storage account container that contains source image files. |
MDWDataLake_NYCImages | References your MDWDataLake-suffix storage account and it acts as the destination for the image files copied from MDWResources_NYCImages. |
MDWDataLake_NYCImageMetadata | References your MDWDataLake-suffix storage account and it acts as the destination for the image metadata files generated by Databricks. |
MDWCosmosDB_ImageMetadata | References MDWCosmosDB-suffix database that will save the metadata info for all images. |
![](./Media/Lab4-Image33.png)
-
Type “Azure Blob Storage” in the search box and select Azure Blob Storage.
-
On the Select Format blade, select Binary and click Continue.
-
On the New Data Set tab, enter the following details:
- General > Name: MDWResources_NYCImages
- Connection > Linked Service: MDWResources
- Connection > File Path: nycimagesAlternatively you can copy and paste the dataset JSON definition below:
{ "name": "MDWResources_NYCImages", "properties": { "linkedServiceName": { "referenceName": "MDWResources", "type": "LinkedServiceReference" }, "type": "AzureBlob", "typeProperties": { "fileName": "", "folderPath": "nycimages" } }, "type": "Microsoft.DataFactory/factories/datasets" }
-
Leave remaining fields with default values.
-
Repeat the process to create another dataset, this time referencing the NYCImages container in your MDWDataLake-suffix storage account.
Alternatively you can copy and paste the dataset JSON definition below:
{ "name": "MDWDataLake_NYCImages", "properties": { "linkedServiceName": { "referenceName": "MDWDataLake", "type": "LinkedServiceReference" }, "type": "AzureBlob", "typeProperties": { "folderPath": "nycimages" } }, "type": "Microsoft.DataFactory/factories/datasets" }
-
Repeat the process to create another dataset, this time referencing the NYCImageMetadata container in your MDWDataLake-suffix storage account.
-
On the New Data Set tab, enter the following details:
- General > Name: MDWDataLake_NYCImageMetadata
- Connection > Linked Service: MDWDataLake
- Connection > File Path: nycimagemetadata
- File format: JSON formatAlternatively you can copy and paste the dataset JSON definition below:
{ "name": "MDWDataLake_NYCImageMetadata", "properties": { "linkedServiceName": { "referenceName": "MDWDataLake", "type": "LinkedServiceReference" }, "type": "AzureBlob", "typeProperties": { "format": { "type": "JsonFormat", "filePattern": "setOfObjects" }, "fileName": "", "folderPath": "nycimagemetadata" } }, "type": "Microsoft.DataFactory/factories/datasets" }
-
Repeat the process to create another dataset, this time referencing the ImageMetadata collection in your MDWCosmosDB database.
-
Type “Cosmos DB” in the search box and select Azure Cosmos DB (SQL API). Click Continue.
-
On the New Data Set tab, enter the following details:
- General > Name: MDWCosmosDB_NYCImageMetadata
- Connection > Linked Service: MDWCosmosDB
- Collection name: ImageMetadataAlternatively you can copy and paste the dataset JSON definition below:
{ "name": "MDWCosmosDB_NYCImageMetadata", "properties": { "linkedServiceName": { "referenceName": "MDWCosmosDB", "type": "LinkedServiceReference" }, "type": "DocumentDbCollection", "typeProperties": { "collectionName": "ImageMetadata" } }, "type": "Microsoft.DataFactory/factories/datasets" }
-
Publish your dataset changes by clicking the Publish all button.
In this section you will create an Azure Data Factory pipeline to copy New York images from MDWResources into your MDWDataLakesuffix storage account. The pipeline will then execute a Databricks notebook for each image and generate a metadata file in the NYCImageMetadata container. The pipeline finishes by saving the image metadata content in a CosmosDB database.
-
click Add Pipeline to create a new pipeline.
-
On the New Pipeline tab, enter the following details:
- General > Name: Copy NYC Images
- Variables > [click + New] >
- Name: ImageMetadataContainerUrl
- Default Value: https://mdwdatalake*suffix*.blob.core.windows.net/nycimages/ -
Leave remaining fields with default values.
-
From the Activities panel, type “Copy Data” in the search box. Drag the Copy Data activity on to the design surface. This copy activity will copy image files from shared storage account MDWResources to your MDWDatalake storage account.
-
Select the Copy Data activity and enter the following details:
- General > Name: CopyImageFiles
- Source > Source dataset: MDWResources_NYCImages
- Sink > Sink dataset: MDWDataLake_NYCImages
- Sink > Copy Behavior: Preserve Hierarchy -
Leave remaining fields with default values.
-
From the Activities panel, type “Get Metadata” in the search box. Drag the Get Metadata activity on to the design surface. This activity will retrieve a list of image files saved in the NYCImages container by the previous CopyImageFiles activity.
-
Select the Get Metadata activity and enter the following details:
- General > Name: GetImageFileList
- Dataset: MDWDataLake_NYCImages
- Source > Field list: Child Items -
Leave remaining fields with default values.
-
Create a Success (green) precedence constraint between CopyImageFiles and GetImageFileList activities. You can do it by dragging the green connector from CopyImageFiles and landing the arrow onto GetImageFileList.
-
From the Activities panel, type “ForEach” in the search box. Drag the ForEach activity on to the design surface. This ForEach activity will act as a container for other activities that will be executed in the context of each image files returned by the GetImageFileList activity.
-
Select the ForEach activity and enter the following details:
- General > Name: ForEachImage
- Settings > Items:@activity('GetImageFileList').output.childItems
-
Leave remaining fields with default values.
-
Create a Success (green) precedence constraint between GetImageFileList and ForEachImage activities. You can do it by dragging the green connector from GetImageFileList and landing the arrow onto ForEachImage.
-
Double-click the ForEachImage activity to edit its contents. You must double click.
IMPORTANT: Note the design context is displayed on the top left-hand side of the design canvas.
-
From the Activities panel, type “Notebook” in the search box. Drag the Notebook activity on to the design surface. This Notebook activity will pass the image URL as a parameter to the Databricks notebook we created previously (if you remember, it has a parameter).
-
Select the Notebook activity and enter the following details:
- General > Name: GetImageMetadata
- Azure Databricks > Databricks Linked Service: MDWDatabricks
- Settings > Notebook path: [Click Browse and navigate to /Users/your-user-name/ImageMetadata-Lab]
- Base Parameters: [Click + New] >
- nycImageUrl:@concat(variables('ImageMetadataContainerUrl'), item().name)
Remember, the dbx notebook puts the metadata in JSON files in your datalake. Now we need to copy that JSON data to CosmosDB.
-
Navigate back to the “Copy NYC Images” pipeline canvas.
-
From the Activities panel, type “Copy Data” in the search box. Drag the Copy Data activity on to the design surface. This copy activity will copy image metadata from the JSON files sitting on the NYCImageMetadata container in MDWDataLake to the ImageMetadata collection on CosmosDB.
-
Select the Copy Data activity and enter the following details:
- General > Name: ServeImageMetadata
- Source > Source dataset: MDWDataLake_NYCImageMetadata
- Sink > Sink dataset: MDWCosmosDB_NYCImageMetadata -
Leave remaining fields with default values.
-
Create a Success (green) precedence constraint between ForEachImage and ServeImageMetadata activities. You can do it by dragging the green connector from ForEachImage and landing the arrow onto ServeImageMetadata.
-
Publish your pipeline changes by clicking the Publish all button.
-
To execute the pipeline, click on Add trigger menu and then Trigger Now.
-
To monitor the execution of your pipeline, click on the Monitor menu on the left-hand side panel.
-
You should be able to see the Status of your pipeline execution on the right-hand side panel.
-
Click the View Activity Runs button for detailed information about each activity execution in the pipeline. The whole execution should last between 7-8 minutes.
If you see a HTTP 400 error, it's likely due to the permissions on your blob folder
Doublecheck that your json metadata was correctly written to ADLS.
In this section you will explore the image metadata records generated by the Azure Data Factory pipeline in CosmosDB. You will use the Cosmos DB’s SQL API to write SQL-like queries and retrieve data based on their criteria.
-
In the Azure Portal, go to the lab resource group and locate the CosmosDB account MDWCosmosDB-suffix.
-
On the Data Explorer panel, click Open Full Screen button on the top right-hand side of the screen.
-
On the Open Full Screen pop-up window, click Open.
-
On the Azure Cosmos DB Data Explorer window, under NYC > ImageMetadata click Documents to see the full list of documents in the collection.
-
Click any document in the list to see its contents.
-
Click the ellipsis (…) next to ImageMetadata collection.
-
On the pop-up menu, click New SQL Query to open a new query tab.
-
On the New Query 1 window, try the two different SQL Commands from the list. Click the Execute Selection button to execute your query.
SELECT m.id
, m.imageUrl
FROM ImageMetadata as m
SELECT m.id
, m.imageUrl
, tags.name
FROM ImageMetadata as m
JOIN tags IN m.tags
WHERE tags.name = 'wedding'
In this section you are going to use Power BI to visualize data from Cosmos DB. The Power BI report will use an Import connection to retrieve image metadata from Cosmos DB and visualise images sitting in your data lake.
IMPORTANT |
---|
Execute these steps on your host computer |
-
Navigate to the Azure Portal and retrieve the mdwcosmosdb-suffix access key.
-
Save it to notepad. You will need it in the next step.
-
Wherever you have Power BI installed (MDWDesktop or your personal laptop) open MDWLab4.pbit from the git repo.
-
When prompted to enter the value of the MDWCosmosDB parameter, type the full server URI: https://mdwcosmosdb-*suffix*.documents.azure.com:443/
-
Click Load.
-
When prompted for an Account Key, paste the MDWCosmosDB account key you retrieved in the previous exercise.
-
Click Connect.
-
Once data finish loading, interact with the report by clicking on the different images displayed and check the accuracy of their associated metadata.
-
Save your work and close Power BI Desktop.
- Was it absolutely necessary to use Databricks to run the Cognitive Services code? Is there a better solution?
- Could we simply put the enter workflow in dbx and avoid using ADF entirely?
- How could we design this solution so that we could ingest new images every hour (or minute, day, etc) to a separate area of the data lake?