In this demo a solution named Databoss will be used to connect and apply Azure data services.
This is the high-level design with main components adn the data flow:
This project is implemented almost fully within private network architecture, making use of Private Link and Service Endpoints to securely connect to resources.
Copy the .auto.tfvars
template:
cp templates/template.tf .auto.tfvars
Check your public IP address to be added in the firewalls allow rules:
dig +short myip.opendns.com @resolver1.opendns.com
Add your public IP address to the public_ip_address_to_allow
variable.
Apply and create the Azure infrastructure:
terraform init
terraform apply -auto-approve
Pause the Synapse SQL pool to avoid costs while setting up the infrastructure:
az synapse sql pool pause -n pool1 --workspace-name synw-databoss -g rg-databoss
Once the apply
phase is complete, approve the managed private endpoints for ADF:
bash scripts/approveManagedPrivateEndpoints.sh
💡 A single connection to Databricks is required to create the access policies on Azure Key Vault.
If everything is OK, proceed to the next section.
Upload some test data:
bash scripts/uploadFilesToDataLake.sh
bash scripts/uploadFilesToExternalStorage.sh
Run the ADF pipeline import data from the external storage into the data lake:
az datafactory pipeline create-run \
--resource-group rg-databoss \
--name Adfv2CopyExternalFileToLake \
--factory-name adf-databoss
If you've stopped the Synapse pool, resume
it:
az synapse sql pool resume -n pool1 --workspace-name synw-databoss -g rg-databoss
Create the template scripts in Synapse:
bash scripts/createSynapseSQLScripts.sh
Now, connect to Synapse Web UI or directly to the SQL endpoint and and execute the scripts.
The previous Azure run should have created the databricks/.auto.tfvars
file to configure Databricks.
Apply the Databricks configuration:
💡 If you haven't yet, you need to login to Databricks, which will create Key Vault policies.
terraform -chdir="databricks" init
terraform -chdir="databricks" apply -auto-approve
Check the workspace files and run the test notebooks and make sure that connectivity is complete.
Deployment command:
func azure functionapp publish <FunctionAppName>
Create the virtual environment:
python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
deactivate
Start the function:
func start
Get the Service Bus connection string:
az servicebus namespace authorization-rule keys list -n RootManageSharedAccessKey --namespace-name bus-databoss -g rg-databoss
Create the local.settings.json
file:
{
"IsEncrypted": false,
"Values": {
"FUNCTIONS_WORKER_RUNTIME": "python",
"AzureWebJobsFeatureFlags": "EnableWorkerIndexing",
"AzureWebJobsStorage": "",
"AzureWebJobsServiceBusConnectionString": ""
}
}
- Consume IP addresses
- Internal runtime
- Code repository
- AD permissions
- Azure Monitor (Logs, Insights)
- Enable IR interactive authoring
Delete the Databricks configuration:
terraform -chdir="databricks" destroy -auto-approve
Delete the Azure infrastructure:
terraform destroy -auto-approve
- Tutorial: ADLSv2, Azure Databricks & Spark
- ADF Private Endpoints
- Integration runtime in Azure Data Factory
- Connect to Azure Data Lake Storage Gen2 and Blob Storage
- Azure Databricks: Manage service principals
- Azure Databricks: Query data in Azure Synapse Analytics
- Azure Synapse: Azure Private Link Hubs