[SPARK-20163] Kill all running tasks in a stage in case of fetch failure#17485
[SPARK-20163] Kill all running tasks in a stage in case of fetch failure#17485sitalkedia wants to merge 1 commit intoapache:masterfrom
Conversation
|
cc - @kayousterhout, @squito, @tgravescs, @markhamstra |
|
see the discussion on the mailing list. We now have 4 different jira for handling fetch failures. I think we should get a design for the entire thing first. personally I don't want to kill the running ones as they have done useful work. |
| sched.backend.killTask( | ||
| attemptInfo.taskId, | ||
| attemptInfo.executorId, | ||
| interruptThread = true, |
There was a problem hiding this comment.
That's not valid. We don't know that this can be done safely, which is why spark.job.interruptOnCancel defaults to false. SPARK-17064
There was a problem hiding this comment.
I see, @markhamstra, does it makes sense to do it only if spark.job.interruptOnCancel is enabled?
There was a problem hiding this comment.
We can do it then, but there is still the question of whether we should do it. That discussion belongs in SPARK-20178.
|
Test build #75402 has finished for PR 17485 at commit
|
Sure @tgravescs, let me put out a design doc with my initial thoughts on it. |
Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238
This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr. Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5
## What changes were proposed in this pull request? This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr.
What changes were proposed in this pull request?
Currently, the scheduler does not kill the running tasks in a stage when it encounters fetch failure, as a result, we might end up running many duplicate tasks in the cluster. There is already a TODO in TaskSetManager to kill all running tasks which has not been implemented.
How was this patch tested?
Unit tests.