-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI. #1566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Due to problems with when we update runningStages (in DAGScheduler.scala) and how we decide to send a SparkListenerStageCompleted message to SparkListeners, somtimes stages can be shown as "running" in the UI forever (even after they have failed). This issue can manifest when stages are resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable tasks. The problem also resulted in a (small) memory leak in the DAGScheduler, where stages can stay in runningStages forever. This commit fixes that problem and adds a unit test.
QA tests have started for PR 1566. This patch merges cleanly. |
@@ -710,7 +710,6 @@ class DAGScheduler( | |||
if (missing == Nil) { | |||
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents") | |||
submitMissingTasks(stage, jobId.get) | |||
runningStages += stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to clarify what's going on here: prior to my change, we added a stage to runningStages here, after calling submitMissingTasks (so after the code I modified below gets executed). This could lead to a memory leak (if the stage needed to be aborted in submitMissingTasks, due to a NotSerializableException for example, because then it would never be removed from runningStages). It also meant that the DAGScheduler sent a SparkListenerStageSubmitted event to the UI, but never a SparkListenerStageCompleted (because, on line 1072, we only send a SparkListenerStageCompleted event if the stage is in runningStages).
Makes sense. LGTM |
Thanks for the quick review @markhamstra ! |
QA results for PR 1566: |
Looks good to me too. I've merged this. |
BTW I've merged this only into 1.1 because the patch didn't apply cleanly on 1.0. If you think it's important, we can also add it to 1.0.x, but it doesn't seem like that big of a showstopper. |
Yeah that seems fine to me -- thanks Matei! |
Due to problems with when we update runningStages (in DAGScheduler.scala) and how we decide to send a SparkListenerStageCompleted message to SparkListeners, sometimes stages can be shown as "running" in the UI forever (even after they have failed). This issue can manifest when stages are resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable tasks. The problem also resulted in a (small) memory leak in the DAGScheduler, where stages can stay in runningStages forever. This commit fixes that problem and adds a unit test. Thanks tsudukim for helping to look into this issue! cc markhamstra rxin Author: Kay Ousterhout <kayousterhout@gmail.com> Closes apache#1566 from kayousterhout/dag_fix and squashes the following commits: 217d74b [Kay Ousterhout] [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI.
Due to problems with when we update runningStages (in DAGScheduler.scala)
and how we decide to send a SparkListenerStageCompleted message to
SparkListeners, sometimes stages can be shown as "running" in the UI forever
(even after they have failed). This issue can manifest when stages are
resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
where stages can stay in runningStages forever. This commit fixes
that problem and adds a unit test.
Thanks @tsudukim for helping to look into this issue!
cc @markhamstra @rxin