[core] Add bundle id as a label; #16819

fishbone · 2021-07-01T20:25:14Z

Why are these changes needed?

Right now, we have a bug where the waiting jobs will fail due to resources capacity.
This PR fixed it by adding bundle id as an implicit flag.

TODO: add this flag to all jobs when the bundle is assigned.

Related issue number

Closes #15842

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

fishbone · 2021-07-02T07:00:14Z

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

@@ -775,6 +773,9 @@ bool ClusterResourceScheduler::AllocateTaskResourceInstances(
        }
      }
    } else {
+      // Allocation failed. Restore node's local resources by freeing the resources
+      // of the failed allocation.
+      FreeTaskResourceInstances(task_allocation);


@wuisawesome I think it's a bug here, but not sure, could you confirm it?

@wuisawesome in case you missed this one

@wuisawesome please verify if this makes sense. @iycheng is this related to this PR? Can we make a separate PR for the fix?

I think it's just a fix for this one but feel like worth separating for it. I'm not sure whether it's easy to add a unit test for this as well. Lack of bandwidth for that if it's too difficult. I'm ok to roll it back and create an issue if you prefer.

So it seems like the scenario you are thinking is the allocation only contains the subset of custom resources right?

e.g.,

Task: custom_1: 1.0, custom_2: 1.0

Node: custom_1: 1.0

and this won't free custom_1.

I think this should be easy to unit test? If we add this in this PR, I think we should add unit test at least

Ok, then let me roll it back for now and create a p1 ticket for this.

@rkooo567 I think this code path will only be reached in this case:

placement group created

schedule with that pg (before running this function)

placement group returned

fail to allocate the resource for placement group

The case you mentioned won't trigger this error, since it won't even be scheduled to this node.

fishbone · 2021-07-02T07:01:08Z

src/ray/raylet/scheduling/cluster_resource_data.cc

@@ -313,13 +316,6 @@ std::string NodeResources::DebugString(StringIdMap string_to_in_map) const {
  return buffer.str();
 }

-const std::string format_resource(std::string resource_name, double quantity) {


Remove duplicate function in scheduling_resources.h

rkooo567 · 2021-07-06T21:43:32Z

I will leave the label until tests are actually passing? If you'd like us to see the PR before that, let us know.

fishbone · 2021-07-07T18:42:44Z

I will leave the label until tests are actually passing? If you'd like us to see the PR before that, let us know.

~~@rkooo567 It'll be good if you can check whether this solution is a reasonable one. It'll change the behavior of the current implementation.~~

~~Before:~~

~~If resource == 0, it'll be scheduled to any node that fit even placement group bundle is given.~~

~~After:~~
~~Even resource == 0, when placement group is given, it'll only be scheduled to the node allocated for that PG.~~

Due to the specialty of actor creation job, I only add that for the ready task. The above feature can be added later.

rkooo567 · 2021-07-12T20:08:43Z

I think the general approach makes sense to me. Btw, did you verify with

import ray
import time

ray.init()

@ray.remote
class ChildActor:
    def __init__(self):
        self.val = 1


@ray.remote
class RemoteActor:
    def __init__(self):
        self.val = 1

main_actors = []
child_actors = []

for num_children in [10, 1, 20, 10, 5] * 4:
    pg = ray.util.placement_group(bundles=[{"CPU": 1}] + [{"CPU": 0.01}] * num_children)
    while not ray.get(pg.ready()):
        pass

    main_actor = RemoteActor.options(num_cpus=1, placement_group=pg, placement_group_bundle_index=0).remote()
    main_actors.append(main_actor)

    for j in range(num_children):
        actor = ChildActor.options(
            placement_group=pg,
            num_cpus=0.01).remote()
        child_actors.append(actor)
    pg.ready()

    print("Created actor with", num_children, "children")
    time.sleep(1)

wuisawesome · 2021-07-13T04:41:33Z

python/ray/tests/test_placement_group.py

@@ -28,6 +28,30 @@ def method(self, x):
        return x + 2


+@pytest.mark.parametrize("connect_to_client", [False, True])
+def test_placement_ready(ray_start_cluster, connect_to_client):


Suggested change

def test_placement_ready(ray_start_cluster, connect_to_client):

def test_placement_ready(ray_start_regular_shared, connect_to_client):

(And remove the corresponding cluster utils stuff)

fishbone · 2021-07-13T05:14:17Z

@rkooo567 it's running good with this script.

src/ray/common/task/scheduling_resources.h

src/ray/raylet/placement_group_resource_manager_test.cc

rkooo567 · 2021-07-13T16:34:52Z

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

@@ -775,6 +773,9 @@ bool ClusterResourceScheduler::AllocateTaskResourceInstances(
        }
      }
    } else {
+      // Allocation failed. Restore node's local resources by freeing the resources
+      // of the failed allocation.
+      FreeTaskResourceInstances(task_allocation);


@wuisawesome please verify if this makes sense. @iycheng is this related to this PR? Can we make a separate PR for the fix?

rkooo567 · 2021-07-13T16:35:45Z

python/ray/tests/test_placement_group.py

+    with connect_to_client_or_not(connect_to_client):
+        pg = ray.util.placement_group(bundles=[{"CPU": 1}])
+        ray.get(pg.ready())
+        a = Actor.options(num_cpus=1, placement_group=pg).remote()


This is not the right test isn't this? We should test with the default actor that requires 0 cpu?

no, this PR only fix the issue #15842 here

Hmm I see. Can you actually add a comment here to explain what the test is doing? I think you can say;

# Test when all the bundle is occupied by the placement group, ready task is still schedulable

Also, I prefer to test 0 cpu case because that's the behavior change the PR will invoke.

oh actually, I saw #16819 (comment).

Does that mean this won't work for 0 cpu actor case still right?

Last followup after sync:

Comment what the test is doing.

Make the bundle resource private?

fishbone · 2021-07-13T20:17:16Z

Roll it back and let's follow the bug in a separate ticket #17044

rkooo567

Can you follow up with #16819 (comment)? That's the last thing I'd like to discuss. Other parts LGMT

rkooo567

LGTM once the last comments are addressed https://github.com/ray-project/ray/pull/16819/files#r669822319

#39946) Before, actors scheduled with no resources and placement groups would not be placed with the placement group or bundle. Fix this by using the phantom bundle resource that is also used for the Ready() call as an additional placement group constraint. (See #16819 for context)

fishbone added 16 commits July 1, 2021 19:44

check

eb342b5

up

73a5020

up

c85a371

up

95a77d4

up

a1297af

up

e449b4c

up

b25f64e

format

76d44f3

up

152ca23

up

d45f424

add test

1108789

format

29a69ed

up

06671b5

format

0ffe932

up

ce0b5f4

format

01a6069

fishbone changed the title ~~Pl fix~~ [core] Add bundle id as a label; Jul 2, 2021

fishbone assigned wuisawesome Jul 2, 2021

fishbone marked this pull request as ready for review July 2, 2021 06:58

fishbone commented Jul 2, 2021

View reviewed changes

rkooo567 self-assigned this Jul 2, 2021

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2021

fishbone added 4 commits July 8, 2021 21:10

up

84970e9

Merge remote-tracking branch 'upstream/master' into pl-fix

837d5ae

up

83b8d34

Merge remote-tracking branch 'upstream/master' into pl-fix

0b23631

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 12, 2021

wuisawesome approved these changes Jul 13, 2021

View reviewed changes

fix comment

ef0a18b

rkooo567 reviewed Jul 13, 2021

View reviewed changes

fishbone added 4 commits July 13, 2021 19:22

up

b11b505

rollback

7ff984c

uncomment

f398d30

format

9a5f70b

rkooo567 requested changes Jul 13, 2021

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 13, 2021

rkooo567 self-requested a review July 14, 2021 17:42

rkooo567 approved these changes Jul 14, 2021

View reviewed changes

fishbone added 3 commits July 14, 2021 18:52

fix comments

93a94bb

fix mac build

0a723aa

Merge remote-tracking branch 'upstream/master' into pl-fix

c1a809a

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 15, 2021

rkooo567 merged commit 1386762 into ray-project:master Jul 15, 2021

vitsai mentioned this pull request Sep 28, 2023

[core] Fix placement groups scheduling when no resources are specified #39946

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Add bundle id as a label; #16819

[core] Add bundle id as a label; #16819

fishbone commented Jul 1, 2021 •

edited

Loading

fishbone Jul 2, 2021

fishbone Jul 13, 2021

rkooo567 Jul 13, 2021

fishbone Jul 13, 2021

rkooo567 Jul 13, 2021

fishbone Jul 13, 2021

fishbone Jul 14, 2021

fishbone Jul 2, 2021

rkooo567 commented Jul 6, 2021

fishbone commented Jul 7, 2021 •

edited

Loading

rkooo567 commented Jul 12, 2021

wuisawesome Jul 13, 2021

wuisawesome Jul 13, 2021

fishbone commented Jul 13, 2021

rkooo567 Jul 13, 2021

rkooo567 Jul 13, 2021

fishbone Jul 13, 2021

rkooo567 Jul 13, 2021

rkooo567 Jul 13, 2021

rkooo567 Jul 14, 2021

fishbone commented Jul 13, 2021

rkooo567 left a comment

rkooo567 left a comment

	def test_placement_ready(ray_start_cluster, connect_to_client):
	def test_placement_ready(ray_start_regular_shared, connect_to_client):

[core] Add bundle id as a label; #16819

[core] Add bundle id as a label; #16819

Conversation

fishbone commented Jul 1, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Jul 6, 2021

fishbone commented Jul 7, 2021 • edited Loading

rkooo567 commented Jul 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Jul 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Jul 13, 2021

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

fishbone commented Jul 1, 2021 •

edited

Loading

fishbone commented Jul 7, 2021 •

edited

Loading