Skip to content

[SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI. #1566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kayousterhout
Copy link
Contributor

Due to problems with when we update runningStages (in DAGScheduler.scala)
and how we decide to send a SparkListenerStageCompleted message to
SparkListeners, sometimes stages can be shown as "running" in the UI forever
(even after they have failed). This issue can manifest when stages are
resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
where stages can stay in runningStages forever. This commit fixes
that problem and adds a unit test.

Thanks @tsudukim for helping to look into this issue!

cc @markhamstra @rxin

Due to problems with when we update runningStages (in DAGScheduler.scala)
and how we decide to send a SparkListenerStageCompleted message to
SparkListeners, somtimes stages can be shown as "running" in the UI forever
(even after they have failed).  This issue can manifest when stages are
resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
where stages can stay in runningStages forever. This commit fixes
that problem and adds a unit test.
@SparkQA
Copy link

SparkQA commented Jul 24, 2014

QA tests have started for PR 1566. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17096/consoleFull

@@ -710,7 +710,6 @@ class DAGScheduler(
if (missing == Nil) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
runningStages += stage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to clarify what's going on here: prior to my change, we added a stage to runningStages here, after calling submitMissingTasks (so after the code I modified below gets executed). This could lead to a memory leak (if the stage needed to be aborted in submitMissingTasks, due to a NotSerializableException for example, because then it would never be removed from runningStages). It also meant that the DAGScheduler sent a SparkListenerStageSubmitted event to the UI, but never a SparkListenerStageCompleted (because, on line 1072, we only send a SparkListenerStageCompleted event if the stage is in runningStages).

@markhamstra
Copy link
Contributor

Makes sense. LGTM

@kayousterhout
Copy link
Contributor Author

Thanks for the quick review @markhamstra !

@SparkQA
Copy link

SparkQA commented Jul 24, 2014

QA results for PR 1566:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17096/consoleFull

@asfgit asfgit closed this in 37ad3b7 Jul 25, 2014
@mateiz
Copy link
Contributor

mateiz commented Jul 25, 2014

Looks good to me too. I've merged this.

@mateiz
Copy link
Contributor

mateiz commented Jul 25, 2014

BTW I've merged this only into 1.1 because the patch didn't apply cleanly on 1.0. If you think it's important, we can also add it to 1.0.x, but it doesn't seem like that big of a showstopper.

@kayousterhout
Copy link
Contributor Author

Yeah that seems fine to me -- thanks Matei!

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Due to problems with when we update runningStages (in DAGScheduler.scala)
and how we decide to send a SparkListenerStageCompleted message to
SparkListeners, sometimes stages can be shown as "running" in the UI forever
(even after they have failed).  This issue can manifest when stages are
resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
where stages can stay in runningStages forever. This commit fixes
that problem and adds a unit test.

Thanks tsudukim for helping to look into this issue!

cc markhamstra rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes apache#1566 from kayousterhout/dag_fix and squashes the following commits:

217d74b [Kay Ousterhout] [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants