Skip to content

Orchestrate Spark Jobs from Kubeflow Pipelines and poll for the status.

License

Notifications You must be signed in to change notification settings

sbakiu/kubeflow-spark

Repository files navigation

Kubeflow Spark

Orchestrate Spark Jobs using Kubeflow, a modern Machine Learning orchestration framework. Read related blog post.

Requirements

  1. Kubernetes cluster (1.17+)
  2. Kubeflow pipelines (1.7.0+)
  3. Spark Operator (1.1.0+)
  4. Python (3.6+)
  5. kubectl
  6. helm3

Getting started

Run make all to start everything and skip to step 6 or:

  1. Start your local cluster
./scripts/start-minikube.sh
  1. Install Kubeflow Pipelines
./scripts/install-kubeflow.sh
  1. Install Spark Operator
./scripts/install-spark-operator.sh
  1. Create Spark Service Account and add permissions
./scripts/add-spark-rbac.sh
  1. Make Kubeflow UI reachable
  • a. (Optional) Add Kubeflow UI Ingress
./scripts/add-kubeflow-ui-ingress.sh
  • b. (Optional) Forward service port, e.g:
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8005:80
  1. Create Kubeflow pipeline definition file
python kubeflow_pipeline.py
  1. Navigate to the Pipelines UI and upload the newly created pipeline from file spark_job_pipeline.yaml

  2. Trigger a pipeline run. Make sure to set spark-sa as Service Account for the execution.

  3. Enjoy your orchestrated Spark job execution!

About

Orchestrate Spark Jobs from Kubeflow Pipelines and poll for the status.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published