Orchestrate Spark Jobs using Kubeflow, a modern Machine Learning orchestration framework. Read related blog post.
- Kubernetes cluster (1.17+)
- Kubeflow pipelines (1.7.0+)
- Spark Operator (1.1.0+)
- Python (3.6+)
- kubectl
- helm3
Run make all
to start everything and skip to step 6 or:
- Start your local cluster
./scripts/start-minikube.sh
- Install Kubeflow Pipelines
./scripts/install-kubeflow.sh
- Install Spark Operator
./scripts/install-spark-operator.sh
- Create Spark Service Account and add permissions
./scripts/add-spark-rbac.sh
- Make Kubeflow UI reachable
- a. (Optional) Add Kubeflow UI Ingress
./scripts/add-kubeflow-ui-ingress.sh
- b. (Optional) Forward service port, e.g:
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8005:80
- Create Kubeflow pipeline definition file
python kubeflow_pipeline.py
-
Navigate to the Pipelines UI and upload the newly created pipeline from file
spark_job_pipeline.yaml
-
Trigger a pipeline run. Make sure to set
spark-sa
as Service Account for the execution. -
Enjoy your orchestrated Spark job execution!