This repo can be used to run spark jobs, spark streaming on Azure Kubernetes Service. Here are the key capabilities it provides
- Uses Spark 3.0.1 and Hadoop 3.2.1
- Integrates Azure Data Lake Storage Gen2 with Azure Kubernetes Services
- Spark jobs, spark streaming and delta can be used on AKS and read and write data to Azure Data Lake Gen2
- Multinode pools can be used in AKS
- Spot instance node pools can be used in AKS to run spark executors
- Livy end point can be used to submit jobs
This repo uses a sample java application that can be used to test spark-submit. This sample java application read from Azure Data Lake Storage and aggregates available NYC taxi data files and writes summary to Azure Data Lake Storage.
- Azure Subscription
- Create Azure Service Principal using - az ad sp create-for-rbac
- Azure CLI
- kubectl CLI
- clone this repository
- Azure Data Lake account with NYC taxi data uploaded to test spark-submit
- Build this repo and upload to ADLS Gen2 container
- TLC Trip Record Data
Update Azure parameters in aks/ You can modify other parameters such as VM size, VNet configuration etc.
SSH_PUBLIC_KEY="your ssh public key"
SUBCRIPTION_ID="your subscription id"
RESOURCE_GROUP="your resource group"
VNET_NAME="your vnet name"
SUBNET_NAME="your subnet name"
SERVICE_PRINCIPAL="your service principal guid"
SERVICE_PRINCIPAL_SECRET="your service principal secret"
Provision AKS cluster.
Add node pool with spot instance.
Configure kubectl and attach Azure Container Registry to AKS Cluster. Attaching ACR to AKS will enable AKS to authenticate to ACR to pull container images.
az aks get-credentials --name $AKS_CLUSTER_NAME -g $RESOURCE_GROUP --admin
kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard
az aks update -n $AKS_CLUSTER_NAME -g $RESOURCE_GROUP --attach-acr $ACR_NAME
Create spark docker image and push it to ACR.
Create spark namespace, role and rolebinding
kubectl apply -f spark/spark-rbac.yaml
Modify parameters such as ADLS Gen2 container names, jar files names etc. in spark/ Submit spark job.
kubectl proxy
Livy can be used to submit spark jobs. Update parameters in livy/ Build livy docker container and push it to Azure Container Registry.
Deploy livy server to AKS. This deploys a public accessible endpoint that can be used to post spark jobs. Modify deploy.yaml to deploy with internal ip address.
kubectl apply -f livy/deploy.yaml
Use postman to post spark jobs using Livy Rest APIs. Modify json payload as required.
HTTP POST http://livy-ipaddress:8998/batches
"name": "NycTaxiData18",
"className": "org.anildwa.spark.NycTaxiData",
"numExecutors": 2,
"driverMemory": "4g",
"executorMemory": "20g",
"executorCores": 6,
"conf": {
"spark.executor.instances" : 2,
"spark.eventLog.enabled" : "true",
"spark.eventLog.dir" :"abfss://spark-event-logs@<storageaccount>",
"spark.kubernetes.driver.volumes.hostPath.aksvm.mount.path" : "/mnt",
"spark.kubernetes.driver.volumes.hostPath.aksvm.options.path" : "/tmp",
"spark.kubernetes.namespace" : "spark",
"spark.kubernetes.authenticate.driver.serviceAccountName" : "spark-sa",
"spark.kubernetes.executor.podTemplateFile" : "/opt/livy/work-dir/executor-pod-template.yaml",
"spark.kubernetes.container.image" : "<youracr><yoursparkimage:tag>",
"spark.kubernetes.container.image.pullPolicy" :"Always",
"<storageaccount>" : "<storage access key>"
"file": "abfss://jars@<storageaccount>",
"args": ["abfss://nytaxidata@<storageaccount>", "abfss://nytaxidata@<storageaccount>"]