Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Documentation for the current state of the world #16

Merged
merged 6 commits into from
Jan 13, 2017

Conversation

mccheah
Copy link

@mccheah mccheah commented Jan 13, 2017

Added a Kubernetes section to the docs. Follows similarly to how the YARN section was laid out.

Partially closes #3 but there still needs to be developer workflow build docs.

@mccheah
Copy link
Author

mccheah commented Jan 13, 2017

There's a bunch of things in these docs that might not be ideal states of the world even for the MVP, mostly little things. We can clean up the docs as the implementation evolves, though.


For example, if the registry host is `registry-host` and the registry is listening on port 5000:

cd $SPARK_HOME

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd $SPARK_HOME/dist ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is documentation under the assumption that Spark was unpacked from a tarball, such as when it is downloaded from the Spark website.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also record dev-workflow docs somewhere, these aren't included in this PR just yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree.

</td>
</tr>
<tr>
<td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't have this, I don't know how common using this will be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is fine.

`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
application.
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to specify a main application resource on the container's disk as well. The main trouble here is how to specify that: do we create a new custom file scheme, like docker:// to denote that the file is in the docker image?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sake of usability, I believe we should support the docker:// scheme, since 'no scheme' == file://

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - it's not particularly great that the API is to expect a magical prefix but I don't see a better option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://, or pod://.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @foxish -- I like container:// better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod:// isn't precise enough

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to include container:// urls, we could also use this kind of URL scheme for the uploaded jars, and remove spark.kubernetes.driver.uploads.jars. I kind of like having the partitioning into two settings in this case though since spark.jars has preconceived expectations in all of the cluster managers, and there is some dissonance in making Kubernetes handle spark.jars in a special way. However this then makes specifying the main resource jar inconsistent with specifying the other uploaded jars.

@mccheah
Copy link
Author

mccheah commented Jan 13, 2017

Dev workflow docs probably belong under the resource-managers/kubernetes directory. I will follow up there in a separate PR.

@iyanuobidele
Copy link

LGTM, great job @mccheah

`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
application.
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://, or pod://.

--deploy-mode cluster
--class com.example.applications.PluggableApplication
--master k8s://https://192.168.99.100
--kubernetes-namespace spark.kubernetes.namespace=default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying the spark.kubernetes.namespace along with the --kubernetes-namespace flag seems a little redundant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is a typo

lifted in the future include:
* Applications can only use a fixed number of executors. Dynamic allocation is not supported.
* Applications can only run in cluster mode.
* The external shuffle service cannot be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic allocation being unsupported also implies this I think.

@foxish
Copy link
Member

foxish commented Jan 13, 2017

Looks very good overall @mccheah! Thanks for taking the effort.

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @mccheah ! I think this is already at a really high level of quality which is awesome. Docs like these go a long ways towards bringing folks along with the platform.

I also think distributing just the small Dockerfile text files themselves and letting the user do with those as they wish sidesteps a lot of issues around Apache publishing, though we'll be discussing that more shortly I think.

Excited to see the progress here!

---

Support for running on [Kubernetes](https://kubernetes.io/) is available in experimental status. The feature set is
currently limited and not well-tested.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's lay it on a little stronger and make a note about this not being recommended for running in production


## Setting Up Docker Images

In order to run Spark on Kubernetes, a Docker image must be built and available on the Docker registry. Spark
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and available in an accessible Docker registry" (thinking about on-prem, internet-isolated use cases)

The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the
master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is
contacted at the appropriate inner URL. The HTTP protocol must also be specified.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean may also be specified, with a mind towards non-SSL API servers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be specified as a full URL currently, this is not ideal though and we should update when we change the default to https

master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is
contacted at the appropriate inner URL. The HTTP protocol must also be specified.

Note that applications can currently only be executed in cluster mode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in cluster mode, where the Spark driver and its executors are running in the cluster"


### Adding Other JARs

Spark allows users to provide dependencies that live on the driver's docker image, or that are on the local disk of the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalize Docker

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that live on" -> "that are bundled into"

<td><code>spark.kubernetes.submit.caCertFile</code></td>
<td>(none)</td>
<td>
CA Cert file for connecting to Kubernetes over HTTPs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase cert. uppercase HTTPS

<td><code>spark.kubernetes.submit.clientKeyFile</code></td>
<td>(none)</td>
<td>
Client key file for authenticating against the Kubernetes API server.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these paths? just to local files, no https/http supported here? let's make that explicit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are files living on the local disk of the submitting machine.

<td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td>
<td>(none)</td>
<td>
Comma-separated list of jars to be sent to the driver only when submitting the application in cluster mode.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that bit about only when submitting the application in cluster mode relevant? k8s only supports cluster mode now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In client mode when we do end up supporting it, this won't really apply there. I think this is fine even if redundant as we don't have to change this part of the docs down the road.

</td>
</tr>
</table>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put an example here where we try to simplify the spark-submit command as much possible. Something like:


For example, the first example above can be rewritten from:

   bin/spark-submit
      --deploy-mode cluster
      --class com.example.applications.SampleApplication
      --master k8s://https://192.168.99.100
      --kubernetes-namespace spark.kubernetes.namespace=default 
      --upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
      --conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
      --conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
      /home/exampleuser/exampleapplication/main.jar

to

   bin/spark-submit
      --class com.example.applications.SampleApplication
      --upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
      /home/exampleuser/exampleapplication/main.jar

with these contents in spark-defaults.conf:

spark.master k8s://https://192.168.99.100
spark.submit.deployMode cluster
spark.kubernetes.namespace default
spark.kubernetes.driver.docker.image registry-host:5000/spark-driver:latest
spark.kubernetes.executor.docker.image registry-host:5000/spark-executor:latest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure what this extra example gets us that would be helpful for usage that we don't have already - can you elaborate here what the goal is? Having to include spark-defaults.conf as well only makes it seem even more confusing. I mainly modeled the examples off of here: https://spark.apache.org/docs/latest/submitting-applications.html

`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the
application.
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @foxish -- I like container:// better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod:// isn't precise enough

@mccheah
Copy link
Author

mccheah commented Jan 13, 2017

@ash211 @foxish addressed comments. I also scattered TODOs in the docs for things that are documented as the current state of the world but that we want to change in the very near future.


## Setting Up Docker Images

In order to run Spark on Kubernetes, a Docker image must be built and available on an accessible Docker registry. Spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no direct dependency on Docker, so, s/Docker image/container image/. We can say that in the common case, we need to build docker images and push it to their image registry.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, much of the Spark code assumes that Docker is the virtualization framework being used. For example, the image is specified via the configuration setting spark.kubernetes.driver.docker.image. Should we also be making all of these configurations more generic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay for now. I'll discuss this and send a separate PR if we do need to change the wording and the command line args.


<!-- TODO master should default to https if no scheme is specified -->
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s://<api_server_url missing closing >

The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the
master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is
contacted at the appropriate inner URL. The HTTP protocol must also be specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at the appropriate inner URL" seems a bit unclear. We can omit that perhaps.

@mccheah
Copy link
Author

mccheah commented Jan 13, 2017

@foxish made the images section a little more generic and gave special note to Docker as what we provide out of the box. It's very likely that the wording and terminology could be improved there however.

@foxish
Copy link
Member

foxish commented Jan 13, 2017

@mccheah, sg. Thanks! The rest LGTM.

bin/spark-submit
--deploy-mode cluster
--class org.apache.spark.examples.SparkPi
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw is the https necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for now, later we want to remove it and make https the default, but optionally the user can fill in http.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now these are supported:

  • k8s://https://<host>:<port>
  • k8s://http://<host>:<port>

In #19 we want to additionally make this supported:

  • k8s://<host>:<port> which would be equivalent to k8s://https://<host>:<port>

@ash211
Copy link

ash211 commented Jan 13, 2017

Looks great to me! @foxish want to merge it?

@foxish
Copy link
Member

foxish commented Jan 13, 2017

One last nit. Virtual runtime -> Container runtime in the beginning. Will merge once that is fixed.

@mccheah
Copy link
Author

mccheah commented Jan 13, 2017

@foxish done

@foxish foxish merged commit 5c6650d into k8s-support-alternate-incremental Jan 13, 2017
@foxish foxish deleted the k8s-docs branch January 13, 2017 22:11
ash211 pushed a commit that referenced this pull request Feb 8, 2017
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
ash211 pushed a commit that referenced this pull request Mar 8, 2017
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
foxish pushed a commit that referenced this pull request Jul 24, 2017
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
)

* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
ifilonenko referenced this pull request in bloomberg/apache-spark-on-k8s Mar 13, 2019
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants