[SPARK-18278] [Scheduler] Support native submission of spark jobs to a kubernetes cluster #16061

erikerlandson · 2016-11-29T16:14:30Z

What changes were proposed in this pull request?

Add support for native submission of spark jobs to a kubernetes cluster, as a first-class scheduler back-end.

How was this patch tested?

Currently, testing has been mostly informal. Integration tests will be added before completion.

Notes

The initial branch on this PR is intended to serve as an "MVP" base for integrating various contributions to a "k8s-native" effort being conducted by several interested community members, including:

@foxish, @erikerlandson, @iyanuobidele
https://github.com/foxish/spark/tree/k8s-support/kubernetes

@mccheah
foxish#7

Goals for integrating features include:

build of docker image (or images) for spark
integration tests
publishing of third-party resources
resource constraints
external shuffle service via daemon-sets
possible hooks for control of executor scale by kubernetes control plane

Instructions

Current instructions for building at the time of this submission are here:
https://github.com/foxish/spark/tree/k8s-support/kubernetes

* Use images with spark pre-installed * simplify staging for client.jar * Remove some tarball-uri code. Fix kube client URI in scheduler backend. Number executors default to 1 * tweak client again, works across my testing environments * use executor.sh shim * allow configuration of service account name for driver pod * spark image as a configuration setting instead of env var * namespace from spark.kubernetes.namespace * configure client with namespace; smooths out cases when not logged in as admin * Assume a download jar to /opt/spark/kubernetes to avoid dropping protections on /opt

* Add support for dynamic executors * fill in some sane logic for doKillExecutors * doRequestTotalExecutors signals graceful executor shutdown, and favors idle executors

SparkQA · 2016-11-29T16:18:57Z

Test build #69335 has finished for PR 16061 at commit 8584913.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2016-11-29T17:38:32Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

@@ -596,6 +599,26 @@ object SparkSubmit extends CommandLineUtils {
      }
    }

+    if (isKubernetesCluster) {


What if in Kubernetes and client mode?

We should throw an error for now if in client mode with Kubernetes. There's still open questions about how we can manage the networking there, so we can revisit the support later.

Hey Guys, really excited about this work by the way. :)

Just wondering if client mode from within a kubernetes cluster will be supported? Not looking to add work, just curious.

I'm wondering whether we could check if the client IP is within the kubernetes container CIDR (this info should be available via the kubernetes API), instead of just blocking client mode altogether. This would support Zeppelin/Jupyter instances running within kube connecting to spark. Which is a big usecase in our case.

We could take the nodelist and check if the client IP is in any of the v1.NodeSpec.PodCIDR of any node. Although as k8s clusters get large (nodes>1000), this may become a costly operation.

This would allow for client mode support, provided the client is on one of the kube nodes.

tnachen · 2016-11-29T17:38:48Z

dev/make-distribution.sh

-BUILD_COMMAND=("$MVN" -T 1C clean package -DskipTests $@)
+# BUILD_COMMAND=("$MVN" -T 1C clean package -DskipTests $@)
+
+BUILD_COMMAND=("$MVN" -T 2C package -DskipTests $@)


This should be reverted.

tnachen · 2016-11-29T17:42:14Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      logInfo(s"Adding $delta new executors")
+      createExecutorPods(delta)
+    } else if (delta < 0) {
+      val d = -delta


This shouldn't happen, assert instead

Delta (as currently computed) can become negative, whenever the requested total starts to decrease.

However, this logic warrants a total overhaul, since it was designed around the assumption that doRequestTotalExecutors is symmetric, in the sense of having to handle a decreasing total analogously to an increasing total. But responsibilities for executor shutdown vs spin-up aren't symmetric. I might file a small doc PR to clarify that point.

We also need to add a method to find the not-ready pods in the system from the earlier scaling event, before we decide to create a new one. We had a discussion about this here. I think relying on people to set resource requests/limits, or writing admission controllers correctly is one part of the solution, but we should also have an additional safeguard to ensure we're not flooding the system with pod creation requests.

I had started in on keeping track of new pods until they stop saying "pending", however I think this may be a good use case for a PodWatcher

tnachen · 2016-11-29T17:42:38Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+import scala.concurrent.Future
+
+private[spark] class KubernetesClusterSchedulerBackend(
+                                                  scheduler: TaskSchedulerImpl,


Fix the formatting to conform with Spark style

tnachen · 2016-11-29T17:43:00Z

...rc/main/scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterScheduler.scala

+      .withName(svcName)
+      .create(svc)
+
+//    try {


rxin · 2016-11-29T18:05:24Z

This is pretty cool.

rxin · 2016-11-29T18:06:22Z

One thing - can we submit a separate pr to move all resource managers into

resource-managers/yarn
resource-managers/mesos

?

tnachen · 2016-11-29T18:08:06Z

@rxin Makes sense, @srowen also talked about starting the discussion of having a better support for external cluster managers as well.

liancheng · 2016-11-29T19:08:18Z

@erikerlandson For the RAT failure, you may either add Apache license header to newly added files or add the file to dev/.rat-excludes.

erikerlandson · 2016-11-29T19:43:17Z

@rxin, when you say "move all resource managers" does that mean "move scheduler back-ends for mesos, yarn, etc, into some resource-managers sub-project" ?

ash211 · 2016-11-30T03:42:20Z

Another external scheduler backend I'm aware of is Two Sigma's scheduler backend for the system they've created called Cook. See CoarseCookSchedulerBackend.scala

viirya · 2016-11-30T04:53:06Z

kubernetes/README.md

+    * ./build/mvn -Pkubernetes -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests package
+* Ensure that you are pointing to a k8s cluster (kubectl config current-context), which you want to use with spark.
+* Launch a spark-submit job:
+   * `./bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi --master k8s://default --conf spark.executor.instances=5 --conf spark.kubernetes.sparkImage=manyangled/kube-spark:dynamic  http://storage.googleapis.com/foxish-spark-distro/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 10000`


Do we need to prepare an official image for this?

There will need to be some official Apache Spark repository for images, which I presume will be up to Apache Spark to create. The exact nature of the images to be produced is still being discussed

I did create a "semi-official" org up on docker hub called "k8s4spark":
https://hub.docker.com/u/k8s4spark/dashboard/

I haven't actually pushed any images to it, but we could start using it as an interim repo if people think that is useful

viirya · 2016-11-30T04:53:46Z

kubernetes/README.md

+
+# Steps to compile
+
+* Clone the fork of spark: https://github.com/foxish/spark/ and switch to the k8s-support branch.


I think this is not correct now?

foxish · 2016-11-30T18:48:50Z

Cross posting link to the proposal here.

mccheah · 2016-11-30T20:26:09Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+      .addNewContainer().withName("spark-executor").withImage(sparkImage)
+      .withImagePullPolicy("IfNotPresent")
+      .withCommand("/opt/executor.sh")


Another approach that we could take is to put the command in the Dockerfile and supply the container with environment variables to configure the runtime behavior. A few benefits to this:

Transparency: Instead of needing to inspect the shim scripts to discover the specific execution behavior, the behavior is well-defined in the Dockerfile.

Immutability: The run command does not need to be assembled by the client code every time.

Decoupling configuration from logic: The Docker containers are only configured with properties, as opposed to needing to specify both logic (running the shim script) and configuration.

I think if we can just execute the container and pass config via something along the lines of SPARK_MASTER_OPTS (is there SPARK_DRIVER_OPTS ?), that could also implicitly handle the case of running a custom user container. It might shift documentation of conventions to which env vars to expect, but if we can use standard spark env-vars that would be idiomatic.

That makes sense - essentially the Dockerfile has this line:

CMD exec ${JAVA_HOME}/bin/java -Xmx$SPARK_EXECUTOR_MEMORY org.apache.spark... --executor-id $SPARK_EXECUTOR_ID ... etc.

And then we set these environment variables on the container.

IIUC, you don't even need to throw the -Xmx mem flags; the driver, executor, etc are supposed to honor those (and the *_CORES), aren't they? We've been using spark-class to run the executor backends. And I think just spark-submit (client mode) for the driver. Although I think it all eventually bakes down to a jvm call.

spark-class wasn't built to handle starting executors so it won't pick up memory flags for them. We have to execute the Java binary directly. spark-class I believe was built for running long-lived daemons like the history server and standalone components.

As for the driver - we can run spark-submit for now but when we move to supporting uploading files from the submitter's local disk we're going to need an in-between process in the pod to upload the local resources to it.

Actually I think I was mistaken here - spark-class eventually calls to this class which then forks the actual process running the passed-in class. We can probably then indeed take advantage of SPARK_JAVA_OPTS and SPARK_EXECUTOR_MEMORY, etc. Arguments like executor-id which are passed as command line arguments and not JVM options should probably still be set as environment variables on the container though and the Dockerfile handles formatting the command line arguments sent to spark-class.

mccheah · 2016-11-30T20:26:59Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    submitArgs ++= Vector("org.apache.spark.executor.CoarseGrainedExecutorBackend",
+      "--driver-url", s"$driverURL",
+      "--executor-id", s"$executorNum",
+      "--hostname", "localhost",


Should we use the HOSTNAME environment variable here?

I'm not sure how that plays out in a pod context. "localhost" has been working, but its worth testing alternatives.

mccheah · 2016-11-30T20:31:22Z

dev/deps/spark-deps-hadoop-2.4

@@ -98,7 +98,7 @@ jersey-client-2.22.2.jar
 jersey-common-2.22.2.jar
 jersey-container-servlet-2.22.2.jar
 jersey-container-servlet-core-2.22.2.jar
-jersey-guava-2.22.2.jar
+jersey-guava-2.22.2.jarshaded-proto


Any particular reason why this changed? Would be curious to trace how this dependency changed.

mccheah · 2016-11-30T20:43:48Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  var executorID = 0
+
+  val sparkImage = conf.get("spark.kubernetes.sparkImage")
+  val clientJarUri = conf.get("spark.executor.jar")


Is this used anywhere in this class?

mccheah · 2016-11-30T20:45:11Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      "--driver-url", s"$driverURL",
+      "--executor-id", s"$executorNum",
+      "--hostname", "localhost",
+      "--app-id", "1", // TODO: change app-id per application and pass from driver.


We can fill this in with the applicationId() method defined in SchedulerBackend.

mccheah · 2016-11-30T20:49:08Z

...rc/main/scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterScheduler.scala

+  }
+
+  def stop(): Unit = {
+    client.pods().inNamespace(nameSpace).withName(driverName).delete()


I think if we do this here in cluster mode we will shut ourselves down and risk not actually finishing the stop() method. It's also possible that the SparkContext object and its associated components like the scheduler can be stopped, but the JVM still is doing other things afterwards. Therefore I don't think we should be deleting the driver pod here.

erikerlandson · 2016-12-01T23:21:53Z

Since we have at least four people with interest in working on the code, I have a workflow proposal, which is that we all submit candidate updates to #16061 in the form of PRs against erikerlandson:spark-18278-k8s-native (recursive PRs against a PR!)

I did this with Anirudh against his k8s-support branch, and it seemed to work fine. It should help keep asynchronous contributions sane, and provides the usual durable public forum for discussing the updates as they evolve.

erikerlandson · 2016-12-01T23:25:54Z

At the risk of redundant announcements, I am going to present this spark-on-kube topic at the Kubernetes SIG-Apps next Monday (Dec 5). It will be similar to my earlier OpenShift briefing but the focus will be on kube specifically, and I'll update it to cover all the latest developments.

iyanuobidele · 2016-12-02T00:19:02Z

@erikerlandson are there any new commits to this branch ? Or it's at the same level as @foxish's k8s-support. In case I need to rebase....

foxish · 2016-12-02T00:25:07Z

I would think that it would be easier if we continue to keep PRs and issues in foxish/spark since we have the previous PRs and issues over there. It should be easy enough to pick those commits up and update this branch once we review and merge them. It doesn't matter where it lives really, except that it seems like most of it is already in one place.

erikerlandson · 2016-12-02T17:08:01Z

@iyanuobidele the current head of this branch should be equivalent, but it can't hurt to rebase just in case.

@foxish yesterday occurred to me that it would've been sensible set up a GH organization for this little consortium, and run the PR from there. Presumably we still could. The drawback to making any change to it now is losing the dialog that's happened here already.

The problem I'm interested in solving even more is ensuring that everybody who contributes to this PR gets logged somehow in the upstream commit history when this PR eventually gets merged. I don't think that would happen in the case of a typical squash-merge, although I assume it would using a rebase merge. @rxin do you have any thoughts on that?

PS: I created a GH org https://github.com/apache-spark-on-k8s in the event that there is interest

rxin · 2016-12-03T01:58:51Z

We should acknowledge people that have contributed to this in the merged commit, but I don't think it'd make a lot of sense to do a merge rather than squash. The initial Spark SQL commit was a single commit even though it was a much larger project #146

## What changes were proposed in this pull request? * Moves yarn and mesos scheduler backends to resource-managers/ sub-directory (in preparation for https://issues.apache.org/jira/browse/SPARK-18278) * Corresponding change in top-level pom.xml. Ref: apache#16061 (comment) ## How was this patch tested? * Manual tests /cc rxin Author: Anirudh <ramanathana@google.com> Closes apache#16092 from foxish/fix-scheduler-structure-2.

jaceklaskowski · 2017-03-03T17:46:45Z

kubernetes/README.md

+
+* Clone the fork of spark: https://github.com/foxish/spark/ and switch to the k8s-support branch.
+* Build the project
+    * ./build/mvn -Pkubernetes -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests package


I think hadoop-2.4 is gone in master.

jaceklaskowski · 2017-03-03T17:48:11Z

kubernetes/pom.xml

+  <parent>
+    <groupId>org.apache.spark</groupId>
+    <artifactId>spark-parent_2.11</artifactId>
+    <version>2.1.0-SNAPSHOT</version>


2.1.0-SNAPSHOT???

jaceklaskowski · 2017-03-03T17:48:36Z

kubernetes/pom.xml

+    <relativePath>../pom.xml</relativePath>
+  </parent>
+
+  <artifactId>spark-kubernetes_2.11</artifactId>


${scala.binary.version}

mccheah · 2017-03-03T19:02:21Z

Work on this feature has moved to https://github.com/apache-spark-on-k8s/spark. The exact diff we are working with is this: apache-spark-on-k8s#1. Feel free to provide feedback on the fork's PR.

@erikerlandson can you close this PR?

foxish · 2017-03-03T19:05:19Z

+1 I think we should close this PR to avoid confusion.

mccheah · 2017-03-23T17:41:28Z

The diff showing our progress so far has moved to apache-spark-on-k8s#200 for anyone following along.

foxish and others added 14 commits November 29, 2016 08:46

Adding kubernetes cluster manager support.

d392800

Various fixes to driver code.

028f36c

Fixes to cluster scheduler backend for scaling

0439454

Create README.md

098f551

Add files via upload

196aad5

Delete Screenshot from 2016-10-07 19:57:34.png

24d4896

Add files via upload

b99662a

Update README.md

ec59033

Update README.md

4cab6e4

Add support for dynamic executors (apache#4)

70a6f3f

* Add support for dynamic executors * fill in some sane logic for doKillExecutors * doRequestTotalExecutors signals graceful executor shutdown, and favors idle executors

[revert] adding new yamls for TPR

63bc532

propagate user-set Spark configurations to the driver pod (apache#5)

7876892

Updated image to use.

8584913

erikerlandson mentioned this pull request Nov 29, 2016

Initial Kubernetes cluster manager implementation. foxish/spark#7

Closed

tnachen reviewed Nov 29, 2016

View reviewed changes

viirya reviewed Nov 30, 2016

View reviewed changes

mccheah reviewed Nov 30, 2016

View reviewed changes

foxish mentioned this pull request Dec 1, 2016

[SPARK-18662] Move resource managers to separate directory #16092

Closed

jaceklaskowski reviewed Mar 3, 2017

View reviewed changes

srowen mentioned this pull request Mar 22, 2017

[INFRA] Close stale PRs #17386

Closed

asfgit closed this in b70c03a Mar 23, 2017


		# Steps to compile

		* Clone the fork of spark: https://github.com/foxish/spark/ and switch to the k8s-support branch.

[SPARK-18278] [Scheduler] Support native submission of spark jobs to a kubernetes cluster #16061

[SPARK-18278] [Scheduler] Support native submission of spark jobs to a kubernetes cluster #16061

Uh oh!

Conversation

erikerlandson commented Nov 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Notes

Instructions

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 29, 2016

Uh oh!

rxin commented Nov 29, 2016

Uh oh!

tnachen commented Nov 29, 2016

Uh oh!

liancheng commented Nov 29, 2016

Uh oh!

erikerlandson commented Nov 29, 2016

Uh oh!

ash211 commented Nov 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish commented Nov 30, 2016

Uh oh!

mccheah Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish Nov 30, 2016 •

edited

Loading

mccheah Nov 30, 2016 •

edited

Loading

foxish commented Dec 2, 2016 •

edited

Loading

erikerlandson commented Dec 2, 2016 •

edited

Loading

rxin commented Dec 3, 2016 •

edited

Loading