-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-27065][CORE] avoid more than one active task set managers for a stage #23927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
424a3c8
53c6ed8
f94809d
0ca733d
07d7de9
58f646e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -201,28 +201,39 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B | |
// Even if one of the task sets has not-serializable tasks, the other task set should | ||
// still be processed without error | ||
taskScheduler.submitTasks(FakeTask.createTaskSet(1)) | ||
taskScheduler.submitTasks(taskSet) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can't have 2 active task set managers at the same time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we shall just give it another stageId ? |
||
val taskSet2 = new TaskSet( | ||
Array(new NotSerializableFakeTask(1, 0), new NotSerializableFakeTask(0, 1)), 1, 0, 0, null) | ||
taskScheduler.submitTasks(taskSet2) | ||
taskDescriptions = taskScheduler.resourceOffers(multiCoreWorkerOffers).flatten | ||
assert(taskDescriptions.map(_.executorId) === Seq("executor0")) | ||
} | ||
|
||
test("refuse to schedule concurrent attempts for the same stage (SPARK-8103)") { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this part of code is reverted in this PR, so remove the test as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is fine, but do we also want to add a test case to ensure the new behavior will not break ? |
||
test("concurrent attempts for the same stage only have one active taskset") { | ||
val taskScheduler = setupScheduler() | ||
def isTasksetZombie(taskset: TaskSet): Boolean = { | ||
taskScheduler.taskSetManagerForAttempt(taskset.stageId, taskset.stageAttemptId).get.isZombie | ||
} | ||
|
||
val attempt1 = FakeTask.createTaskSet(1, 0) | ||
val attempt2 = FakeTask.createTaskSet(1, 1) | ||
taskScheduler.submitTasks(attempt1) | ||
intercept[IllegalStateException] { taskScheduler.submitTasks(attempt2) } | ||
// The first submitted taskset is active | ||
assert(!isTasksetZombie(attempt1)) | ||
|
||
// OK to submit multiple if previous attempts are all zombie | ||
taskScheduler.taskSetManagerForAttempt(attempt1.stageId, attempt1.stageAttemptId) | ||
.get.isZombie = true | ||
val attempt2 = FakeTask.createTaskSet(1, 1) | ||
taskScheduler.submitTasks(attempt2) | ||
// The first submitted taskset is zombie now | ||
assert(isTasksetZombie(attempt1)) | ||
// The newly submitted taskset is active | ||
assert(!isTasksetZombie(attempt2)) | ||
|
||
val attempt3 = FakeTask.createTaskSet(1, 2) | ||
intercept[IllegalStateException] { taskScheduler.submitTasks(attempt3) } | ||
taskScheduler.taskSetManagerForAttempt(attempt2.stageId, attempt2.stageAttemptId) | ||
.get.isZombie = true | ||
taskScheduler.submitTasks(attempt3) | ||
assert(!failedTaskSet) | ||
// The first submitted taskset remains zombie | ||
assert(isTasksetZombie(attempt1)) | ||
// The second submitted taskset is zombie now | ||
assert(isTasksetZombie(attempt2)) | ||
// The newly submitted taskset is active | ||
assert(!isTasksetZombie(attempt3)) | ||
} | ||
|
||
test("don't schedule more tasks after a taskset is zombie") { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If TSM3 is created just after TSM2 finished partition 10, so, how does TSM3 know about the finished partition 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to know, and Spark will just waste resource to run unnecessary tasks. The cluster will not crush.
That's why I said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If TSM here is for a result stage, so when TSM2 finished partition 10 which commited output to HDFS, TSM3 would throw TaskCommitDeniedException due to launch task for partition 10. And I think this is what #22806 and #23871 try to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR focuses on fixing the potential occurrence of
java.lang.IllegalStateException: more than one active taskSet for stage
, which is described in https://issues.apache.org/jira/browse/SPARK-23433 .https://issues.apache.org/jira/browse/SPARK-25250 remains unfixed and will be addressed in #22806 or #23871 .
Note that, SPARK-23433 can crush the cluster, even #22806 or #23871 can fix it as well, we need a simple fix and backport to 2.3/2.4.
SPARK-25250 is just a matter of wasting resource, we can keep the fix in master only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's make sense and the
Update
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it makes sense to fix the issue that this PR addresses alongwith the other PR's for SPARK-25250.