Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

Closed
wants to merge 2 commits into from

Conversation

ssaavedra
Copy link

@ssaavedra ssaavedra commented Sep 28, 2017

What changes were proposed in this pull request?

Due to the changes made in spawning the service attached to the driver pod, the spark.driver.bindAddress property is now important to reset when restarting a workload. Also, many spark.kubernetes.* properties change due to the spark-submit process and how the configmap, secrets and other related variables get uploaded and resolved. This change features checkpoint restoration for streaming workloads.

How was this patch tested?

This patch was tested with the twitter-streaming example in AWS, using checkpoints in s3 with the s3a:// protocol, as supported by Hadoop.

@mccheah
Copy link

mccheah commented Sep 28, 2017

How does this impact non-K8s jobs?

@ifilonenko
Copy link
Member

Shouldn't this be a PR for Spark Core in the main repo? How is this k8s specific?

@ssaavedra
Copy link
Author

This was not an issue before in Kubernetes either because there were no services involved. When bindAddress is not used, it takes its value by default from spark.driver.host. I am trying to run a Streaming example reliably and this is a regression I found from v0.3.0 to v0.4.0.

@mccheah
Copy link

mccheah commented Sep 29, 2017

Can this be fixed in a way that is specific to Kubernetes?

Several configuration parameters related to Kubernetes need to be
reset, as they are changed with each invokation of spark-submit and
thus prevents recovery of Spark Streaming tasks.
@ssaavedra
Copy link
Author

The first version of this PR seems to be related to code in Spark indeed, and not really related to the Kubernetes integration. However, after digging deeper, these kubernetes-related properties need to also be reloaded.

The upstream issue with bindAddress: https://issues.apache.org/jira/browse/SPARK-22294
There is also the spark of a discussion on whether to propose a more general framework at this other issue: apache#19469

@ssaavedra ssaavedra changed the title Reset spark.driver.bindAddress when starting a Checkpoint Reset many spark.kubernetes.* configurations when restarting from a Checkpoint Oct 17, 2017
@ssaavedra
Copy link
Author

Is anyone reviewing this?

@foxish
Copy link
Member

foxish commented Dec 12, 2017

Sorry @ssaavedra, most of us are busy with the ongoing upstreaming effort. @mccheah @ifilonenko do you have some cycles to review this?

Copy link

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with streaming nor the reason why we would have to do this reset - but if there's a precedent to do that with the other settings that are cluster-manager specific then that's probably ok.

But some of these configurations are only used in the init container, and those configurations should be provided automatically by the config map that the driver puts into the executor pods. We should remove those extraneous configurations. Think they are:

+      "spark.kubernetes.initcontainer.downloadJarsResourceIdentifier",
 +      "spark.kubernetes.initcontainer.downloadJarsSecretLocation",
 +      "spark.kubernetes.initcontainer.downloadFilesResourceIdentifier",
 +      "spark.kubernetes.initcontainer.downloadFilesSecretLocation",
 +      "spark.kubernetes.initcontainer.remoteJars",
 +      "spark.kubernetes.initcontainer.remoteFiles"

@ifilonenko
Copy link
Member

Can an integration test be added?

@ssaavedra
Copy link
Author

Sorry, I haven't had time to spin up my testing environment lately.

@mccheah that will not be enough, and I tried a build with such changes, and that wouldn't work, as I was expecting. The reason is that when performing the spark-submit process, a new configmap gets published in Kubernetes (with a different name than the first one), but then when the driver pod starts in the cluster, it will first restore the data from the checkpoint (thereby erasing the SparkConf properties set by the latter spark-submit and restoring the original ones, except the ones explicitly discarded by this pull request) and then it will start the Spark Context. With the new Spark Context spun up, executors will be brought up, but now they will be configured to use the saved SparkConf properties, and thus will look in the wrong ConfigMap name.

So we do need to make this change. I'm attaching you a screenshot of what happens with your proposed changes.

About an integration test of this caliber, I don't have an idea on how to set up the machinery related to this. I could take a look at it, but I think this could be relevant in the upstreaming process, so that streaming contexts can be resumed when using Kubernetes as the cluster-manager.

screenshot from 2018-01-02-11-07-55

@ssaavedra
Copy link
Author

Ping?

@foxish
Copy link
Member

foxish commented Jan 24, 2018

@ssaavedra, this looks good in general - but we're planning a major rebase of this fork on the upstream apache/spark work after which we can merge. For now, we could propose in upstream - just filed https://issues.apache.org/jira/browse/SPARK-23200, can you propose a PR there under apache/spark? We'll get some eyes on it that way.

@ssaavedra
Copy link
Author

I'm closing this, since it's already upstream.

@ssaavedra ssaavedra closed this Jan 26, 2018
asfgit pushed a commit to apache/spark that referenced this pull request Sep 19, 2018
Several configuration parameters related to Kubernetes need to be
reset, as they are changed with each invokation of spark-submit and
thus prevents recovery of Spark Streaming tasks.

## What changes were proposed in this pull request?

When using the Kubernetes cluster-manager and spawning a Streaming workload, it is important to reset many spark.kubernetes.* properties that are generated by spark-submit but which would get rewritten when restoring a Checkpoint. This is so, because the spark-submit codepath creates Kubernetes resources, such as a ConfigMap, a Secret and other variables, which have an autogenerated name and the previous one will not resolve anymore.

In short, this change enables checkpoint restoration for streaming workloads, and thus enables Spark Streaming workloads in Kubernetes, which were not possible to restore from a checkpoint before if the workload went down.

## How was this patch tested?

This patch needs would benefit from testing in different k8s clusters.

This is similar to the YARN related code for resetting a Spark Streaming workload, but for the Kubernetes scheduler. This PR removes the initcontainers properties that existed before because they are now removed in master.

For a previous discussion, see the non-rebased work at: apache-spark-on-k8s#516

Closes #22392 from ssaavedra/fix-checkpointing-master.

Authored-by: Santiago Saavedra <santiagosaavedra@gmail.com>
Signed-off-by: Yinan Li <ynli@google.com>
(cherry picked from commit 497f00f)
Signed-off-by: Yinan Li <ynli@google.com>
asfgit pushed a commit to apache/spark that referenced this pull request Sep 19, 2018
Several configuration parameters related to Kubernetes need to be
reset, as they are changed with each invokation of spark-submit and
thus prevents recovery of Spark Streaming tasks.

## What changes were proposed in this pull request?

When using the Kubernetes cluster-manager and spawning a Streaming workload, it is important to reset many spark.kubernetes.* properties that are generated by spark-submit but which would get rewritten when restoring a Checkpoint. This is so, because the spark-submit codepath creates Kubernetes resources, such as a ConfigMap, a Secret and other variables, which have an autogenerated name and the previous one will not resolve anymore.

In short, this change enables checkpoint restoration for streaming workloads, and thus enables Spark Streaming workloads in Kubernetes, which were not possible to restore from a checkpoint before if the workload went down.

## How was this patch tested?

This patch needs would benefit from testing in different k8s clusters.

This is similar to the YARN related code for resetting a Spark Streaming workload, but for the Kubernetes scheduler. This PR removes the initcontainers properties that existed before because they are now removed in master.

For a previous discussion, see the non-rebased work at: apache-spark-on-k8s#516

Closes #22392 from ssaavedra/fix-checkpointing-master.

Authored-by: Santiago Saavedra <santiagosaavedra@gmail.com>
Signed-off-by: Yinan Li <ynli@google.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants