-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix placement groups scheduling when no resources are specified #39946
Conversation
Bug fix: #40038 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post the ray status output when there's a pending task that requires a placement gruop (just to see the output does not regress due to this change we always add bundles resources 0.0001)
Here is output with both pending and running pg tasks:
I think the only potential change needed is in node activity and that can be a follow up as it is not super noisy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think we should merge this after we merge #40038.
ping me after ^ |
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
…rset of Resources (#40038) Looking at the code, we use RequiredResources and RequiredPlacementResources in various parts of the codebase for scheduling. This is generally fine, but since AllocateTaskResources in LocalTaskManager uses RequiredResources to allocate, and spillback/scheduling (GetBestSchedulableNode) uses RequiredPlacementResources, there is a potential infinite loop if we don't maintain the invariant of PlacementResources being a superset of Resources. Fix it so that we do. Note the surface area of the bug was actually quite small because this invariant is maintained when num_cpus=1. For testing: This is already tested by test_scheduling.py in #39946
This looks ready to merge; same failures as the stacked PR |
@@ -108,7 +108,8 @@ struct ActorCreationOptions { | |||
max_task_retries(max_task_retries), | |||
max_concurrency(max_concurrency), | |||
resources(resources), | |||
placement_resources(placement_resources), | |||
placement_resources(placement_resources.empty() ? resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is already merged?
Before, actors scheduled with no resources and placement groups would not be placed with the placement group or bundle. Fix this by using the phantom bundle resource that is also used for the Ready() call as an additional placement group constraint. (See #16819 for context)
Why are these changes needed?
Related issue number
#34866
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.