-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26431][CORE] Update availableSlots by availableCpus for Barrier TaskSet #23375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26431][CORE] Update availableSlots by availableCpus for Barrier TaskSet #23375
Conversation
|
ping @jiangxb1987 , please take a look, thanks. |
|
Can one of the admins verify this patch? |
| */ | ||
| def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized { | ||
|
|
||
| def availableSlots(availableCpus: Array[Int]): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is not always correct, please refer to test("don't schedule for a barrier taskSet if available slots are less than pending tasks")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean for "The result is not always correct" ? The way of calculating slots num ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, please refer to the test case above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen and tested that suite before submitting this pr. And I follow the original way of calculating the slots, what's the difference ?
…` before checking available slots for barrier taskSet ### What changes were proposed in this pull request? availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus. ### Why are the changes needed? Bugfix. This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the `require` in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets. Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this `require`. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT Closes #23375 Closes #25946 from juliuszsompolski/SPARK-29263. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 420abb4) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
What changes were proposed in this pull request?
availableCpus decrease as tasks allocated, so, we should update availableSlots by availableCpus for barrier taskset to avoid unnecessary resourceOffer process.
How was this patch tested?
existed.