Documentation for the current state of the world #16

mccheah · 2017-01-13T01:32:19Z

Added a Kubernetes section to the docs. Follows similarly to how the YARN section was laid out.

Partially closes #3 but there still needs to be developer workflow build docs.

mccheah · 2017-01-13T01:39:26Z

There's a bunch of things in these docs that might not be ideal states of the world even for the MVP, mostly little things. We can clean up the docs as the implementation evolves, though.

iyanuobidele · 2017-01-13T01:39:16Z

docs/running-on-kubernetes.md

+
+For example, if the registry host is `registry-host` and the registry is listening on port 5000:
+
+    cd $SPARK_HOME


cd $SPARK_HOME/dist ?

This is documentation under the assumption that Spark was unpacked from a tarball, such as when it is downloaded from the Spark website.

We should also record dev-workflow docs somewhere, these aren't included in this PR just yet.

I totally agree.

mccheah · 2017-01-13T01:50:07Z

docs/running-on-kubernetes.md

+  </td>
+</tr>
+<tr>
+  <td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td>


We probably shouldn't have this, I don't know how common using this will be.

I believe this is fine.

mccheah · 2017-01-13T01:51:09Z

docs/running-on-kubernetes.md

+  `spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
+  the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
+  application.
+* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the


It would be nice to be able to specify a main application resource on the container's disk as well. The main trouble here is how to specify that: do we create a new custom file scheme, like docker:// to denote that the file is in the docker image?

for sake of usability, I believe we should support the docker:// scheme, since 'no scheme' == file://

Yep - it's not particularly great that the API is to expect a magical prefix but I don't see a better option.

Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://, or pod://.

Good point @foxish -- I like container:// better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod:// isn't precise enough

If we were to include container:// urls, we could also use this kind of URL scheme for the uploaded jars, and remove spark.kubernetes.driver.uploads.jars. I kind of like having the partitioning into two settings in this case though since spark.jars has preconceived expectations in all of the cluster managers, and there is some dissonance in making Kubernetes handle spark.jars in a special way. However this then makes specifying the main resource jar inconsistent with specifying the other uploaded jars.

mccheah · 2017-01-13T02:00:53Z

Dev workflow docs probably belong under the resource-managers/kubernetes directory. I will follow up there in a separate PR.

iyanuobidele · 2017-01-13T02:03:52Z

LGTM, great job @mccheah

foxish · 2017-01-13T02:32:22Z

docs/running-on-kubernetes.md

+  `spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
+  the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
+  application.
+* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the


Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://, or pod://.

foxish · 2017-01-13T02:33:58Z

docs/running-on-kubernetes.md

+      --deploy-mode cluster 
+      --class com.example.applications.PluggableApplication
+      --master k8s://https://192.168.99.100
+      --kubernetes-namespace spark.kubernetes.namespace=default 


Specifying the spark.kubernetes.namespace along with the --kubernetes-namespace flag seems a little redundant.

Yep this is a typo

foxish · 2017-01-13T02:35:47Z

docs/running-on-kubernetes.md

+lifted in the future include:
+* Applications can only use a fixed number of executors. Dynamic allocation is not supported.
+* Applications can only run in cluster mode.
+* The external shuffle service cannot be used.


Dynamic allocation being unsupported also implies this I think.

foxish · 2017-01-13T02:36:43Z

Looks very good overall @mccheah! Thanks for taking the effort.

ash211

Excellent work @mccheah ! I think this is already at a really high level of quality which is awesome. Docs like these go a long ways towards bringing folks along with the platform.

I also think distributing just the small Dockerfile text files themselves and letting the user do with those as they wish sidesteps a lot of issues around Apache publishing, though we'll be discussing that more shortly I think.

Excited to see the progress here!

ash211 · 2017-01-13T02:05:17Z

docs/running-on-kubernetes.md

+---
+
+Support for running on [Kubernetes](https://kubernetes.io/) is available in experimental status. The feature set is
+currently limited and not well-tested.


let's lay it on a little stronger and make a note about this not being recommended for running in production

ash211 · 2017-01-13T03:00:07Z

docs/running-on-kubernetes.md

+
+## Setting Up Docker Images
+
+In order to run Spark on Kubernetes, a Docker image must be built and available on the Docker registry. Spark


"and available in an accessible Docker registry" (thinking about on-prem, internet-isolated use cases)

ash211 · 2017-01-13T03:02:05Z

docs/running-on-kubernetes.md

+The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
+`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the
+master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is
+contacted  at the appropriate inner URL. The HTTP protocol must also be specified.


do you mean may also be specified, with a mind towards non-SSL API servers?

It must be specified as a full URL currently, this is not ideal though and we should update when we change the default to https

ash211 · 2017-01-13T03:02:52Z

docs/running-on-kubernetes.md

+master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is
+contacted  at the appropriate inner URL. The HTTP protocol must also be specified.
+
+Note that applications can currently only be executed in cluster mode.


"in cluster mode, where the Spark driver and its executors are running in the cluster"

ash211 · 2017-01-13T03:03:49Z

docs/running-on-kubernetes.md

+
+### Adding Other JARs
+
+Spark allows users to provide dependencies that live on the driver's docker image, or that are on the local disk of the


capitalize Docker

"that live on" -> "that are bundled into"

ash211 · 2017-01-13T03:14:10Z

docs/running-on-kubernetes.md

+  <td><code>spark.kubernetes.submit.caCertFile</code></td>
+  <td>(none)</td>
+  <td>
+    CA Cert file for connecting to Kubernetes over HTTPs.


lowercase cert. uppercase HTTPS

ash211 · 2017-01-13T03:15:03Z

docs/running-on-kubernetes.md

+  <td><code>spark.kubernetes.submit.clientKeyFile</code></td>
+  <td>(none)</td>
+  <td>
+    Client key file for authenticating against the Kubernetes API server.


are these paths? just to local files, no https/http supported here? let's make that explicit

These are files living on the local disk of the submitting machine.

ash211 · 2017-01-13T03:16:38Z

docs/running-on-kubernetes.md

+  <td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td>
+  <td>(none)</td>
+  <td>
+    Comma-separated list of jars to be sent to the driver only when submitting the application in cluster mode. 


is that bit about only when submitting the application in cluster mode relevant? k8s only supports cluster mode now

In client mode when we do end up supporting it, this won't really apply there. I think this is fine even if redundant as we don't have to change this part of the docs down the road.

ash211 · 2017-01-13T03:23:30Z

docs/running-on-kubernetes.md

+  </td>
+</tr>
+</table>
+


Put an example here where we try to simplify the spark-submit command as much possible. Something like:

For example, the first example above can be rewritten from:

bin/spark-submit --deploy-mode cluster --class com.example.applications.SampleApplication --master k8s://https://192.168.99.100 --kubernetes-namespace spark.kubernetes.namespace=default --upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar --conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest --conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest /home/exampleuser/exampleapplication/main.jar

to

bin/spark-submit --class com.example.applications.SampleApplication --upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar /home/exampleuser/exampleapplication/main.jar

with these contents in spark-defaults.conf:

spark.master k8s://https://192.168.99.100 spark.submit.deployMode cluster spark.kubernetes.namespace default spark.kubernetes.driver.docker.image registry-host:5000/spark-driver:latest spark.kubernetes.executor.docker.image registry-host:5000/spark-executor:latest

Hm, I'm not sure what this extra example gets us that would be helpful for usage that we don't have already - can you elaborate here what the goal is? Having to include spark-defaults.conf as well only makes it seem even more confusing. I mainly modeled the examples off of here: https://spark.apache.org/docs/latest/submitting-applications.html

ash211 · 2017-01-13T03:26:27Z

docs/running-on-kubernetes.md

+  `spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
+  the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
+  application.
+* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the


Good point @foxish -- I like container:// better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod:// isn't precise enough

…te-incremental' into k8s-docs

mccheah · 2017-01-13T20:56:20Z

@ash211 @foxish addressed comments. I also scattered TODOs in the docs for things that are documented as the current state of the world but that we want to change in the very near future.

foxish · 2017-01-13T21:00:23Z

docs/running-on-kubernetes.md

+
+## Setting Up Docker Images
+
+In order to run Spark on Kubernetes, a Docker image must be built and available on an accessible Docker registry. Spark


There is no direct dependency on Docker, so, s/Docker image/container image/. We can say that in the common case, we need to build docker images and push it to their image registry.

Hm, much of the Spark code assumes that Docker is the virtualization framework being used. For example, the image is specified via the configuration setting spark.kubernetes.driver.docker.image. Should we also be making all of these configurations more generic?

I think it's okay for now. I'll discuss this and send a separate PR if we do need to change the wording and the command line args.

foxish · 2017-01-13T21:02:58Z

docs/running-on-kubernetes.md

+
+<!-- TODO master should default to https if no scheme is specified -->
+The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
+`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the


k8s://<api_server_url missing closing >

foxish · 2017-01-13T21:03:56Z

docs/running-on-kubernetes.md

+The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
+`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the
+master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is
+contacted  at the appropriate inner URL. The HTTP protocol must also be specified.


"at the appropriate inner URL" seems a bit unclear. We can omit that perhaps.

mccheah · 2017-01-13T21:25:51Z

@foxish made the images section a little more generic and gave special note to Docker as what we provide out of the box. It's very likely that the wording and terminology could be improved there however.

foxish · 2017-01-13T21:33:26Z

@mccheah, sg. Thanks! The rest LGTM.

tnachen · 2017-01-13T21:44:46Z

docs/running-on-kubernetes.md

+    bin/spark-submit 
+      --deploy-mode cluster 
+      --class org.apache.spark.examples.SparkPi 
+      --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>


Btw is the https necessary?

It is for now, later we want to remove it and make https the default, but optionally the user can fill in http.

Right now these are supported:

k8s://https://<host>:<port>

k8s://http://<host>:<port>

In #19 we want to additionally make this supported:

k8s://<host>:<port> which would be equivalent to k8s://https://<host>:<port>

ash211 · 2017-01-13T22:02:12Z

Looks great to me! @foxish want to merge it?

foxish · 2017-01-13T22:06:59Z

One last nit. Virtual runtime -> Container runtime in the beginning. Will merge once that is fixed.

mccheah · 2017-01-13T22:09:25Z

@foxish done

* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime

) * Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime

* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime

mccheah added 2 commits January 12, 2017 17:29

Documentation for the current state of the world.

be312e0

Adding navigation links from other pages

cb90f64

iyanuobidele reviewed Jan 13, 2017

View reviewed changes

mccheah commented Jan 13, 2017

View reviewed changes

foxish reviewed Jan 13, 2017

View reviewed changes

ash211 reviewed Jan 13, 2017

View reviewed changes

mccheah added 2 commits January 13, 2017 12:23

Merge remote-tracking branch 'apache-spark-on-k8s/k8s-support-alterna…

229c5a8

…te-incremental' into k8s-docs

Address comments, add TODO for things that should be fixed

ab8436a

foxish reviewed Jan 13, 2017

View reviewed changes

Address comments, mostly making images section clearer

6976b81

tnachen reviewed Jan 13, 2017

View reviewed changes

Virtual runtime -> container runtime

420111b

foxish merged commit 5c6650d into k8s-support-alternate-incremental Jan 13, 2017

foxish deleted the k8s-docs branch January 13, 2017 22:11

ash211 mentioned this pull request Jan 25, 2017

Brief build and usage docs for PR #1 #3

Closed


		For example, if the registry host is `registry-host` and the registry is listening on port 5000:

		cd $SPARK_HOME


		## Setting Up Docker Images

		In order to run Spark on Kubernetes, a Docker image must be built and available on the Docker registry. Spark


		### Adding Other JARs

		Spark allows users to provide dependencies that live on the driver's docker image, or that are on the local disk of the

Documentation for the current state of the world #16

Documentation for the current state of the world #16

Uh oh!

Conversation

mccheah commented Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccheah commented Jan 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah commented Jan 13, 2017

Uh oh!

iyanuobidele commented Jan 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish commented Jan 13, 2017

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah commented Jan 13, 2017 •

edited

Loading

mccheah commented Jan 13, 2017 •

edited

Loading