[SPARK-20163] Kill all running tasks in a stage in case of fetch failure by sitalkedia · Pull Request #17485 · apache/spark

sitalkedia · 2017-03-30T20:40:51Z

What changes were proposed in this pull request?

Currently, the scheduler does not kill the running tasks in a stage when it encounters fetch failure, as a result, we might end up running many duplicate tasks in the cluster. There is already a TODO in TaskSetManager to kill all running tasks which has not been implemented.

How was this patch tested?

Unit tests.

sitalkedia · 2017-03-30T20:52:56Z

cc - @kayousterhout, @squito, @tgravescs, @markhamstra

tgravescs · 2017-03-30T20:59:09Z

see the discussion on the mailing list. We now have 4 different jira for handling fetch failures. I think we should get a design for the entire thing first.

personally I don't want to kill the running ones as they have done useful work.

markhamstra · 2017-03-30T21:12:48Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+              sched.backend.killTask(
+                attemptInfo.taskId,
+                attemptInfo.executorId,
+                interruptThread = true,


That's not valid. We don't know that this can be done safely, which is why spark.job.interruptOnCancel defaults to false. SPARK-17064

I see, @markhamstra, does it makes sense to do it only if spark.job.interruptOnCancel is enabled?

We can do it then, but there is still the question of whether we should do it. That discussion belongs in SPARK-20178.

SparkQA · 2017-03-30T23:33:44Z

Test build #75402 has finished for PR 17485 at commit ec2ac34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-03-31T00:24:43Z

see the discussion on the mailing list. We now have 4 different jira for handling fetch failures. I think we should get a design for the entire thing first.

Sure @tgravescs, let me put out a design doc with my initial thoughts on it.

Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238

This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr. Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5

## What changes were proposed in this pull request? This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr.

[SPARK-20163] Kill all running tasks in a stage in case of fetch failure

ec2ac34

sitalkedia mentioned this pull request Mar 30, 2017

[SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe… #17297

Closed

markhamstra reviewed Mar 30, 2017

View reviewed changes

maropu mentioned this pull request Apr 23, 2017

[BUILD] Close stale PRs #17734

Closed

asfgit closed this in e9f9715 Apr 24, 2017

sitalkedia deleted the kill_tasks_on_stage_failure branch April 25, 2017 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20163] Kill all running tasks in a stage in case of fetch failure#17485

[SPARK-20163] Kill all running tasks in a stage in case of fetch failure#17485
sitalkedia wants to merge 1 commit intoapache:masterfrom
sitalkedia:kill_tasks_on_stage_failure

sitalkedia commented Mar 30, 2017

Uh oh!

sitalkedia commented Mar 30, 2017

Uh oh!

tgravescs commented Mar 30, 2017

Uh oh!

markhamstra Mar 30, 2017

Uh oh!

sitalkedia Mar 31, 2017

Uh oh!

markhamstra Mar 31, 2017

Uh oh!

SparkQA commented Mar 30, 2017

Uh oh!

sitalkedia commented Mar 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sitalkedia commented Mar 30, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sitalkedia commented Mar 30, 2017

Uh oh!

tgravescs commented Mar 30, 2017

Uh oh!

markhamstra Mar 30, 2017

Choose a reason for hiding this comment

Uh oh!

sitalkedia Mar 31, 2017

Choose a reason for hiding this comment

Uh oh!

markhamstra Mar 31, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 30, 2017

Uh oh!

sitalkedia commented Mar 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants