Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix placement groups scheduling when no resources are specified #39946

Merged
merged 16 commits into from
Oct 19, 2023

Conversation

vitsai
Copy link
Contributor

@vitsai vitsai commented Sep 28, 2023

Before, actors scheduled with no resources and placement groups would not be placed with the placement group or bundle. Fix this by using the phantom bundle resource that is also used for the Ready() call as an additional placement group constraint. (See #16819 for context)

Why are these changes needed?

Related issue number

#34866

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai
Copy link
Contributor Author

vitsai commented Oct 3, 2023

Bug fix: #40038

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai vitsai requested a review from a team as a code owner October 3, 2023 16:45
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the ray status output when there's a pending task that requires a placement gruop (just to see the output does not regress due to this change we always add bundles resources 0.0001)

python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved
src/ray/common/bundle_spec.cc Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 4, 2023
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai vitsai removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 5, 2023
@vitsai
Copy link
Contributor Author

vitsai commented Oct 5, 2023

Here is output with both pending and running pg tasks:

======== Autoscaler status: 2023-10-05 07:13:09.740863 ========
GCS request time: 0.000660s

Node status
---------------------------------------------------------------
Active:
 1 node_7dedc912c22c07df0fd49dc5b6e8450472310ad4041ddd2540e519dc
Idle:
 (no idle nodes)
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 4.0/36.0 CPU (4.0 used of 8.0 reserved in placement groups)
 0B/35.23GiB memory
 0B/17.61GiB object_store_memory

Total Demands:
 {'CPU': 1.0}: 6+ pending tasks/actors (6+ using placement groups)

Node: 7dedc912c22c07df0fd49dc5b6e8450472310ad4041ddd2540e519dc
 Usage:
  4.0/36.0 CPU (4.0 used of 8.0 reserved in placement groups)
  0B/35.23GiB memory
  0B/17.61GiB object_store_memory
 Activity:
  Resource: CPU currently in use.
  Busy workers on node.
  Resource: CPU_group_0_38c7049e01961594ddbd48ae247806000000 currently in use.
  Resource: CPU_group_38c7049e01961594ddbd48ae247806000000 currently in use.

I think the only potential change needed is in node activity and that can be a follow up as it is not super noisy.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we should merge this after we merge #40038.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 10, 2023
@rkooo567
Copy link
Contributor

ping me after ^

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai vitsai removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 12, 2023
vitsai added 10 commits October 12, 2023 22:26
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
rkooo567 pushed a commit that referenced this pull request Oct 19, 2023
…rset of Resources (#40038)

Looking at the code, we use RequiredResources and RequiredPlacementResources in various parts of the codebase for scheduling. This is generally fine, but since AllocateTaskResources in LocalTaskManager uses RequiredResources to allocate, and spillback/scheduling (GetBestSchedulableNode) uses RequiredPlacementResources, there is a potential infinite loop if we don't maintain the invariant of PlacementResources being a superset of Resources.

Fix it so that we do. Note the surface area of the bug was actually quite small because this invariant is maintained when num_cpus=1.

For testing:

This is already tested by test_scheduling.py in #39946
@vitsai
Copy link
Contributor Author

vitsai commented Oct 19, 2023

This looks ready to merge; same failures as the stacked PR

@@ -108,7 +108,8 @@ struct ActorCreationOptions {
max_task_retries(max_task_retries),
max_concurrency(max_concurrency),
resources(resources),
placement_resources(placement_resources),
placement_resources(placement_resources.empty() ? resources
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is already merged?

@rkooo567 rkooo567 merged commit af332f4 into ray-project:master Oct 19, 2023
vitsai added a commit that referenced this pull request Oct 19, 2023
aslonnie pushed a commit that referenced this pull request Oct 20, 2023
vitsai added a commit that referenced this pull request Oct 20, 2023
…ces are specified (#39946)" (#40506)"

This reverts commit 66cfec9.

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
jjyao pushed a commit that referenced this pull request Feb 23, 2024
…#39946 (#43269)

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Co-authored-by: vitsai <vitsai@cs.stanford.edu>
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants