-
Notifications
You must be signed in to change notification settings - Fork 118
Documentation for the current state of the world #16
Conversation
There's a bunch of things in these docs that might not be ideal states of the world even for the MVP, mostly little things. We can clean up the docs as the implementation evolves, though. |
|
||
For example, if the registry host is `registry-host` and the registry is listening on port 5000: | ||
|
||
cd $SPARK_HOME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cd $SPARK_HOME/dist ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is documentation under the assumption that Spark was unpacked from a tarball, such as when it is downloaded from the Spark website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also record dev-workflow docs somewhere, these aren't included in this PR just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree.
</td> | ||
</tr> | ||
<tr> | ||
<td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably shouldn't have this, I don't know how common using this will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is fine.
`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on | ||
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the | ||
application. | ||
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to specify a main application resource on the container's disk as well. The main trouble here is how to specify that: do we create a new custom file scheme, like docker://
to denote that the file is in the docker image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sake of usability, I believe we should support the docker://
scheme, since 'no scheme' == file://
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - it's not particularly great that the API is to expect a magical prefix but I don't see a better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://
, or pod://
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @foxish -- I like container://
better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod://
isn't precise enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to include container://
urls, we could also use this kind of URL scheme for the uploaded jars, and remove spark.kubernetes.driver.uploads.jars
. I kind of like having the partitioning into two settings in this case though since spark.jars
has preconceived expectations in all of the cluster managers, and there is some dissonance in making Kubernetes handle spark.jars
in a special way. However this then makes specifying the main resource jar inconsistent with specifying the other uploaded jars.
Dev workflow docs probably belong under the |
LGTM, great job @mccheah |
`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on | ||
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the | ||
application. | ||
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker isn't the only runtime that is supported (although it is the most common), so, we could opt for something neutral like container://
, or pod://
.
--deploy-mode cluster | ||
--class com.example.applications.PluggableApplication | ||
--master k8s://https://192.168.99.100 | ||
--kubernetes-namespace spark.kubernetes.namespace=default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying the spark.kubernetes.namespace
along with the --kubernetes-namespace
flag seems a little redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is a typo
lifted in the future include: | ||
* Applications can only use a fixed number of executors. Dynamic allocation is not supported. | ||
* Applications can only run in cluster mode. | ||
* The external shuffle service cannot be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic allocation being unsupported also implies this I think.
Looks very good overall @mccheah! Thanks for taking the effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @mccheah ! I think this is already at a really high level of quality which is awesome. Docs like these go a long ways towards bringing folks along with the platform.
I also think distributing just the small Dockerfile
text files themselves and letting the user do with those as they wish sidesteps a lot of issues around Apache publishing, though we'll be discussing that more shortly I think.
Excited to see the progress here!
--- | ||
|
||
Support for running on [Kubernetes](https://kubernetes.io/) is available in experimental status. The feature set is | ||
currently limited and not well-tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's lay it on a little stronger and make a note about this not being recommended for running in production
|
||
## Setting Up Docker Images | ||
|
||
In order to run Spark on Kubernetes, a Docker image must be built and available on the Docker registry. Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"and available in an accessible Docker registry" (thinking about on-prem, internet-isolated use cases)
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting | ||
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the | ||
master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is | ||
contacted at the appropriate inner URL. The HTTP protocol must also be specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean may also be specified, with a mind towards non-SSL API servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be specified as a full URL currently, this is not ideal though and we should update when we change the default to https
master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is | ||
contacted at the appropriate inner URL. The HTTP protocol must also be specified. | ||
|
||
Note that applications can currently only be executed in cluster mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in cluster mode, where the Spark driver and its executors are running in the cluster"
|
||
### Adding Other JARs | ||
|
||
Spark allows users to provide dependencies that live on the driver's docker image, or that are on the local disk of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitalize Docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that live on" -> "that are bundled into"
<td><code>spark.kubernetes.submit.caCertFile</code></td> | ||
<td>(none)</td> | ||
<td> | ||
CA Cert file for connecting to Kubernetes over HTTPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase cert
. uppercase HTTPS
<td><code>spark.kubernetes.submit.clientKeyFile</code></td> | ||
<td>(none)</td> | ||
<td> | ||
Client key file for authenticating against the Kubernetes API server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these paths? just to local files, no https/http supported here? let's make that explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are files living on the local disk of the submitting machine.
<td><code>spark.kubernetes.driver.uploads.driverExtraClasspath</code></td> | ||
<td>(none)</td> | ||
<td> | ||
Comma-separated list of jars to be sent to the driver only when submitting the application in cluster mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that bit about only when submitting the application in cluster mode relevant? k8s only supports cluster mode now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In client mode when we do end up supporting it, this won't really apply there. I think this is fine even if redundant as we don't have to change this part of the docs down the road.
</td> | ||
</tr> | ||
</table> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put an example here where we try to simplify the spark-submit command as much possible. Something like:
For example, the first example above can be rewritten from:
bin/spark-submit
--deploy-mode cluster
--class com.example.applications.SampleApplication
--master k8s://https://192.168.99.100
--kubernetes-namespace spark.kubernetes.namespace=default
--upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest
/home/exampleuser/exampleapplication/main.jar
to
bin/spark-submit
--class com.example.applications.SampleApplication
--upload-jars /home/exampleuser/exampleapplication/dep1.jar,/home/exampleuser/exampleapplication/dep2.jar
/home/exampleuser/exampleapplication/main.jar
with these contents in spark-defaults.conf
:
spark.master k8s://https://192.168.99.100
spark.submit.deployMode cluster
spark.kubernetes.namespace default
spark.kubernetes.driver.docker.image registry-host:5000/spark-driver:latest
spark.kubernetes.executor.docker.image registry-host:5000/spark-executor:latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not sure what this extra example gets us that would be helpful for usage that we don't have already - can you elaborate here what the goal is? Having to include spark-defaults.conf as well only makes it seem even more confusing. I mainly modeled the examples off of here: https://spark.apache.org/docs/latest/submitting-applications.html
`spark.kubernetes.driver.uploads.jars` in the application's configuration, will be treated as jars that are located on | ||
the *disk of the submitting machine*. These jars are uploaded to the driver docker container before executing the | ||
application. | ||
* A main application resource path that does not have a scheme or that has the scheme `file://` is assumed to be on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @foxish -- I like container://
better, since in my understanding a k8s pod can have multiple containers each with independent filesystems so pod://
isn't precise enough
|
||
## Setting Up Docker Images | ||
|
||
In order to run Spark on Kubernetes, a Docker image must be built and available on an accessible Docker registry. Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no direct dependency on Docker, so, s/Docker image/container image/. We can say that in the common case, we need to build docker images and push it to their image registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, much of the Spark code assumes that Docker is the virtualization framework being used. For example, the image is specified via the configuration setting spark.kubernetes.driver.docker.image
. Should we also be making all of these configurations more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay for now. I'll discuss this and send a separate PR if we do need to change the wording and the command line args.
|
||
<!-- TODO master should default to https if no scheme is specified --> | ||
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting | ||
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k8s://<api_server_url missing closing >
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting | ||
`spark.master` in the application's configuration, must be a URL with the format `k8s://<api_server_url`. Prefixing the | ||
master string with `k8s://` will cause the Spark application to launch on a Kubernetes cluster, where the API server is | ||
contacted at the appropriate inner URL. The HTTP protocol must also be specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at the appropriate inner URL" seems a bit unclear. We can omit that perhaps.
@foxish made the images section a little more generic and gave special note to Docker as what we provide out of the box. It's very likely that the wording and terminology could be improved there however. |
@mccheah, sg. Thanks! The rest LGTM. |
bin/spark-submit | ||
--deploy-mode cluster | ||
--class org.apache.spark.examples.SparkPi | ||
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw is the https necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for now, later we want to remove it and make https the default, but optionally the user can fill in http
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now these are supported:
k8s://https://<host>:<port>
k8s://http://<host>:<port>
In #19 we want to additionally make this supported:
k8s://<host>:<port>
which would be equivalent tok8s://https://<host>:<port>
Looks great to me! @foxish want to merge it? |
One last nit. Virtual runtime -> Container runtime in the beginning. Will merge once that is fixed. |
@foxish done |
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
Added a Kubernetes section to the docs. Follows similarly to how the YARN section was laid out.
Partially closes #3 but there still needs to be developer workflow build docs.