Skip to content

[SPARK-2567] Resubmitted stage sometimes remains as active stage in the web UI #1516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tsudukim
Copy link
Contributor

Moved the line which post SparkListenerStageSubmitted to the back of check of tasks size and serializability.

…he web UI

Moved the line which post SparkListenerStageSubmitted to the back of check of tasks size and serializability.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Jul 23, 2014

How does this affect the final UI ?

@tsudukim
Copy link
Contributor Author

You can see the screenshot which the original code generated in the JIRA.
https://issues.apache.org/jira/browse/SPARK-2567
This screenshot was taken after the job completed but one stage remained as Active Stage forever.
It shouldn't be displayed in this web ui at all because the corresponding new TaskSet is not submitted and even stage.newAttemptId() isn't called.
Sometimes this ghost stage appears when stage is re-submitted so this PR modified to prevent web ui from showing it.

@markhamstra
Copy link
Contributor

This appears to be a reversion of d58502a while ignoring and misapplying the comment regarding ordering (which I'm not completely understanding.)

@xiajunluan ?

@tsudukim
Copy link
Contributor Author

Hmm... I didn't notice it.
I'm going to rerun the test for confirmation as @xiajunluan 's commit comment.

@rxin
Copy link
Contributor

rxin commented Jul 23, 2014

Actually I aim to fix this in #1545

@tsudukim
Copy link
Contributor Author

The test totally succeeded again.
If the @xiajunluan 's commit only aimed to avoid the unit test error, I think it should be reversioned as this PR. But I'm wondering if there were another aim.
@xiajunluan could you remember it?

@tsudukim
Copy link
Contributor Author

Hi @rxin, thank you for following this ticket but couldn't we separate those problems into different PRs? SPARK-2298 is not about this problem.
I think we will be hard to trace why the code was modified and what discussion was made on the topic later. (just like we are now wondering about the intention of the commit of the last year)

@rxin
Copy link
Contributor

rxin commented Jul 23, 2014

Actually the fix looks good.

@rxin
Copy link
Contributor

rxin commented Jul 23, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA tests have started for PR 1516. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17063/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA results for PR 1516:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17063/consoleFull

@kayousterhout
Copy link
Contributor

I looked into this a bit more and this change doesn't quite fix the problem in the right way, partially because of what @markhamstra pointed out (that we shouldn't send a SparkListenerStageCompleted event before sending a corresponding SparkListenerStageSubmitted event, which is I think what the mysterious comment was getting at) and partially because the bigger underlying problem here is that runningStages isn't updated at the right time (which also leads to a memory leak). I submitted an alternate fix here: #1566 -- let me know what you all think.

@tsudukim
Copy link
Contributor Author

SPARK-2567 is resolved by #1566.

@tsudukim tsudukim closed this Jul 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants