Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

ssaavedra · 2017-09-28T15:37:21Z

What changes were proposed in this pull request?

Due to the changes made in spawning the service attached to the driver pod, the spark.driver.bindAddress property is now important to reset when restarting a workload. Also, many spark.kubernetes.* properties change due to the spark-submit process and how the configmap, secrets and other related variables get uploaded and resolved. This change features checkpoint restoration for streaming workloads.

How was this patch tested?

This patch was tested with the twitter-streaming example in AWS, using checkpoints in s3 with the s3a:// protocol, as supported by Hadoop.

mccheah · 2017-09-28T23:14:09Z

How does this impact non-K8s jobs?

ifilonenko · 2017-09-28T23:51:35Z

Shouldn't this be a PR for Spark Core in the main repo? How is this k8s specific?

ssaavedra · 2017-09-29T09:53:11Z

This was not an issue before in Kubernetes either because there were no services involved. When bindAddress is not used, it takes its value by default from spark.driver.host. I am trying to run a Streaming example reliably and this is a regression I found from v0.3.0 to v0.4.0.

mccheah · 2017-09-29T16:26:53Z

Can this be fixed in a way that is specific to Kubernetes?

Several configuration parameters related to Kubernetes need to be reset, as they are changed with each invokation of spark-submit and thus prevents recovery of Spark Streaming tasks.

ssaavedra · 2017-10-17T13:33:00Z

The first version of this PR seems to be related to code in Spark indeed, and not really related to the Kubernetes integration. However, after digging deeper, these kubernetes-related properties need to also be reloaded.

The upstream issue with bindAddress: https://issues.apache.org/jira/browse/SPARK-22294
There is also the spark of a discussion on whether to propose a more general framework at this other issue: apache#19469

ssaavedra · 2017-12-12T12:08:09Z

Is anyone reviewing this?

foxish · 2017-12-12T19:15:47Z

Sorry @ssaavedra, most of us are busy with the ongoing upstreaming effort. @mccheah @ifilonenko do you have some cycles to review this?

mccheah

I'm not too familiar with streaming nor the reason why we would have to do this reset - but if there's a precedent to do that with the other settings that are cluster-manager specific then that's probably ok.

But some of these configurations are only used in the init container, and those configurations should be provided automatically by the config map that the driver puts into the executor pods. We should remove those extraneous configurations. Think they are:

+      "spark.kubernetes.initcontainer.downloadJarsResourceIdentifier",
 +      "spark.kubernetes.initcontainer.downloadJarsSecretLocation",
 +      "spark.kubernetes.initcontainer.downloadFilesResourceIdentifier",
 +      "spark.kubernetes.initcontainer.downloadFilesSecretLocation",
 +      "spark.kubernetes.initcontainer.remoteJars",
 +      "spark.kubernetes.initcontainer.remoteFiles"

ifilonenko · 2017-12-13T04:05:57Z

Can an integration test be added?

ssaavedra · 2018-01-02T12:07:00Z

Sorry, I haven't had time to spin up my testing environment lately.

@mccheah that will not be enough, and I tried a build with such changes, and that wouldn't work, as I was expecting. The reason is that when performing the spark-submit process, a new configmap gets published in Kubernetes (with a different name than the first one), but then when the driver pod starts in the cluster, it will first restore the data from the checkpoint (thereby erasing the SparkConf properties set by the latter spark-submit and restoring the original ones, except the ones explicitly discarded by this pull request) and then it will start the Spark Context. With the new Spark Context spun up, executors will be brought up, but now they will be configured to use the saved SparkConf properties, and thus will look in the wrong ConfigMap name.

So we do need to make this change. I'm attaching you a screenshot of what happens with your proposed changes.

About an integration test of this caliber, I don't have an idea on how to set up the machinery related to this. I could take a look at it, but I think this could be relevant in the upstreaming process, so that streaming contexts can be resumed when using Kubernetes as the cluster-manager.

ssaavedra · 2018-01-24T10:44:08Z

Ping?

foxish · 2018-01-24T10:57:32Z

@ssaavedra, this looks good in general - but we're planning a major rebase of this fork on the upstream apache/spark work after which we can merge. For now, we could propose in upstream - just filed https://issues.apache.org/jira/browse/SPARK-23200, can you propose a PR there under apache/spark? We'll get some eyes on it that way.

ssaavedra · 2018-01-26T14:50:21Z

I'm closing this, since it's already upstream.

Several configuration parameters related to Kubernetes need to be reset, as they are changed with each invokation of spark-submit and thus prevents recovery of Spark Streaming tasks. ## What changes were proposed in this pull request? When using the Kubernetes cluster-manager and spawning a Streaming workload, it is important to reset many spark.kubernetes.* properties that are generated by spark-submit but which would get rewritten when restoring a Checkpoint. This is so, because the spark-submit codepath creates Kubernetes resources, such as a ConfigMap, a Secret and other variables, which have an autogenerated name and the previous one will not resolve anymore. In short, this change enables checkpoint restoration for streaming workloads, and thus enables Spark Streaming workloads in Kubernetes, which were not possible to restore from a checkpoint before if the workload went down. ## How was this patch tested? This patch needs would benefit from testing in different k8s clusters. This is similar to the YARN related code for resetting a Spark Streaming workload, but for the Kubernetes scheduler. This PR removes the initcontainers properties that existed before because they are now removed in master. For a previous discussion, see the non-rebased work at: apache-spark-on-k8s#516 Closes #22392 from ssaavedra/fix-checkpointing-master. Authored-by: Santiago Saavedra <santiagosaavedra@gmail.com> Signed-off-by: Yinan Li <ynli@google.com> (cherry picked from commit 497f00f) Signed-off-by: Yinan Li <ynli@google.com>

Several configuration parameters related to Kubernetes need to be reset, as they are changed with each invokation of spark-submit and thus prevents recovery of Spark Streaming tasks. ## What changes were proposed in this pull request? When using the Kubernetes cluster-manager and spawning a Streaming workload, it is important to reset many spark.kubernetes.* properties that are generated by spark-submit but which would get rewritten when restoring a Checkpoint. This is so, because the spark-submit codepath creates Kubernetes resources, such as a ConfigMap, a Secret and other variables, which have an autogenerated name and the previous one will not resolve anymore. In short, this change enables checkpoint restoration for streaming workloads, and thus enables Spark Streaming workloads in Kubernetes, which were not possible to restore from a checkpoint before if the workload went down. ## How was this patch tested? This patch needs would benefit from testing in different k8s clusters. This is similar to the YARN related code for resetting a Spark Streaming workload, but for the Kubernetes scheduler. This PR removes the initcontainers properties that existed before because they are now removed in master. For a previous discussion, see the non-rebased work at: apache-spark-on-k8s#516 Closes #22392 from ssaavedra/fix-checkpointing-master. Authored-by: Santiago Saavedra <santiagosaavedra@gmail.com> Signed-off-by: Yinan Li <ynli@google.com>

ssaavedra added 2 commits October 17, 2017 15:24

Reset spark.driver.bindAddress when starting a Checkpoint

fbba96a

Reset Kubernetes-specific config on Checkpoint restore

6605ef9

Several configuration parameters related to Kubernetes need to be reset, as they are changed with each invokation of spark-submit and thus prevents recovery of Spark Streaming tasks.

ssaavedra changed the title ~~Reset spark.driver.bindAddress when starting a Checkpoint~~ Reset many spark.kubernetes.* configurations when restarting from a Checkpoint Oct 17, 2017

jerryshao mentioned this pull request Oct 20, 2017

[SPARK-22243][DStreams]spark.yarn.jars reload from config when Checkpoint recovery apache/spark#19469

Closed

mccheah suggested changes Dec 12, 2017

View reviewed changes

ssaavedra mentioned this pull request Jan 24, 2018

[SPARK-23200] Reset Kubernetes-specific config on Checkpoint restore apache/spark#20383

Closed

ssaavedra closed this Jan 26, 2018

ssaavedra mentioned this pull request Sep 11, 2018

[SPARK-23200] Reset Kubernetes-specific config on Checkpoint restore apache/spark#22392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

Uh oh!

ssaavedra commented Sep 28, 2017 •

edited

Loading

Uh oh!

mccheah commented Sep 28, 2017

Uh oh!

ifilonenko commented Sep 28, 2017

Uh oh!

ssaavedra commented Sep 29, 2017

Uh oh!

mccheah commented Sep 29, 2017

Uh oh!

ssaavedra commented Oct 17, 2017

Uh oh!

ssaavedra commented Dec 12, 2017

Uh oh!

foxish commented Dec 12, 2017

Uh oh!

mccheah left a comment

Uh oh!

ifilonenko commented Dec 13, 2017

Uh oh!

ssaavedra commented Jan 2, 2018

Uh oh!

ssaavedra commented Jan 24, 2018

Uh oh!

foxish commented Jan 24, 2018

Uh oh!

ssaavedra commented Jan 26, 2018

Uh oh!

Uh oh!

Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

Uh oh!

Conversation

ssaavedra commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mccheah commented Sep 28, 2017

Uh oh!

ifilonenko commented Sep 28, 2017

Uh oh!

ssaavedra commented Sep 29, 2017

Uh oh!

mccheah commented Sep 29, 2017

Uh oh!

ssaavedra commented Oct 17, 2017

Uh oh!

ssaavedra commented Dec 12, 2017

Uh oh!

foxish commented Dec 12, 2017

Uh oh!

mccheah left a comment

Choose a reason for hiding this comment

Uh oh!

ifilonenko commented Dec 13, 2017

Uh oh!

ssaavedra commented Jan 2, 2018

Uh oh!

ssaavedra commented Jan 24, 2018

Uh oh!

foxish commented Jan 24, 2018

Uh oh!

ssaavedra commented Jan 26, 2018

Uh oh!

Uh oh!

ssaavedra commented Sep 28, 2017 •

edited

Loading