Spark Operator is an experimental project aiming to make it easier to run Spark-on-Kubernetes applications on a Kubernetes cluster by potentially automating certain tasks such as the following:
- Submitting applications on behalf of users so they don't need to deal with the submission process and the
spark-submit
command. - Mounting user-specified secrets into the driver and executor Pods.
- Mounting user-specified ConfigMaps into the driver and executor Pods.
- Mounting ConfigMaps carrying Spark or Hadoop configuration files that are to be put into a directory referred to by the environment variable
SPARK_CONF_DIR
orHADOOP_CONF_DIR
into the driver and executor Pods. Example use cases include shipping alog4j.properties
file for configuring logging and acore-site.xml
file for configuring Hadoop and/or HDFS access. - Creating a
NodePort
service for the Spark UI running on the driver so the UI can be accessed from outside the Kubernetes cluster, without needing to use API server proxy or port forwarding. - Copying appliation logs to a central place, e.g., a GCS bucket, for bookkeeping, post-run checking, and analysis.
- Automatically creating namespaces and setting up RBAC roles and quotas, and running users' applications in separate namespaces for better resource isolation and quota management.
To make such automation possible, Spark Operator uses the Kubernetes CustomResourceDefinition (CRD) and a corresponding CRD controller as well as an initializer. The CRD controller setups the environment for an application and submits the application to run on behalf of the user, whereas the initializer handles customization of the Spark Pods.
This approach is completely different than the one that has the submission client creates a CRD object. Having externally created and managed CRD objects offer the following benefits:
- Things like creating namespaces and setting up RBAC roles and resource quotas represent a separate concern and are better done before applications get submitted.
- With the external CRD controller submitting applications on behalf of users, they don't need to deal with the submission process and the
spark-submit
command. Instead, the focus is shifted from thinking about commands to thinking about declaractive YAML files describing Spark applications that can be easily version controlled. - Externally created CRD objects make it easier to integrate Spark application submission and monitoring with users' existing pipelines and tooling on Kubernetes.
- Internally created CRD objects are good for capturing and communicating application/executor status to the users, but not for driver/executor pod configuration/customization as very likely it needs external input. Such external input most likely need additional command-line options to get passed in.
Additionally, keeping the CRD implementation outside the Spark repository gives us a lot of flexibility in terms of functionality to add to the CRD controller. We also have full control over code review and release process.
Currently the SparkApplication
CRD controller and the pod initializer work.
Before building the Spark operator the first time, run the following commands to get the required Kubernetes code generators:
go get -u k8s.io/code-generator/cmd/deepcopy-gen
go get -u k8s.io/code-generator/cmd/defaulter-gen
To build the Spark operator, run the following command:
make build
To additionally build a Docker image of the Spark operator, run the following command:
make image-tag=<image tag> image
To further push the Docker image to Docker hub, run the following command: console
make image-tag=<image tag> push
To deploy the Spark operator, run the following command:
kubectl create -f manifest/spark-operator.yaml
Spark Operator is typically deployed and run using manifest/spark-operator.yaml
through a Kubernetes Deployment
. However, users can still run it outside a Kubernetes cluster and make it talk to the Kubernetes API server of a cluster by specifying path to kubeconfig
, which can be done using the --kubeconfig
flag.
Spark Operator uses multiple workers in the initializer controller and the submission runner. The number of worker threads to use in the two places are controlled using command-line flags --initializer-threads
and --submission-threads
, respectively. The default values for both flags are 10 and 3, respectively.
To run the example Spark application, run the following command:
kubectl create -f manifest/spark-application-example.yaml
This will create a SparkApplication
object named spark-app-example
. Check the object by running the following command:
kubectl get sparkapplications spark-app-example -o=yaml
This will show something similar to the following:
apiVersion: spark-operator.k8s.io/v1alpha1
kind: SparkApplication
metadata:
...
spec:
deps: {}
driver:
cores: "0.1"
image: kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
memory: null
executor:
cores: "0.1"
image: kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
instances: 1
memory: 512m
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
mainClass: org.apache.spark.examples.SparkPi
mode: cluster
submissionByUser: false
type: Scala
status:
appId: spark-app-example-1553506518
applicationState:
errorMessage: ""
state: COMPLETED
driverInfo:
podName: spark-app-example-1510603652649-driver
webUIPort: 4040
webUIServiceName: spark-app-example-ui-1553506518
webUIURL: 35.193.197.127:4040
executorState:
spark-app-example-1510603652649-exec-1: COMPLETED
The initializer works independently with or without the CRD controller. The initializer looks for certain custom annotations on Spark driver and executor Pods to perform its tasks. To use the initializer without leveraging the CRD, simply add the needed annotations to the driver and/or executor Pods using the following Spark configuration properties when submitting your Spark applications through the spark-submit
script.
--conf spark.kubernetes.driver.annotations.[AnnotationName]=value
--conf spark.kubernetes.executor.annotations.[AnnotationName]=value
Currently the following annotations are supported:
Annotation | Value |
---|---|
spark-operator.k8s.io/sparkConfigMap |
Name of the Kubernetes ConfigMap storing Spark configuration files (to which SPARK_CONF_DIR applies) |
spark-operator.k8s.io/hadoopConfigMap |
Name of the Kubernetes ConfigMap storing Hadoop configuration files (to which HADOOP_CONF_DIR applies) |
spark-operator.k8s.io/configMap.[ConfigMapName] |
Mount path of the ConfigMap named ConfigMapName |
spark-operator.k8s.io/GCPServiceAccount.[SeviceAccountSecretName] |
Mount path of the secret storing GCP service account credentials (typically a JSON key file) named SeviceAccountSecretName |
When using the SparkApplication
CRD to describe a Spark appliction, the annotations are automatically added by the CRD controller. manifest/spark-application-example-gcp-service-account.yaml
shows an example SparkApplication
with a user-specified GCP service account credential secret named gcp-service-account
that is to be mounted into both the driver and executor containers. The SparkApplication
controller automatically adds the annotation spark-operator.k8s.io/GCPServiceAccount.gcp-service-account=/mnt/secrets
to the driver and executor Pods. The initializer sees the annotation and automatically adds a secret volume into the pods.