-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add bundle id as a label; #16819
Conversation
@@ -775,6 +773,9 @@ bool ClusterResourceScheduler::AllocateTaskResourceInstances( | |||
} | |||
} | |||
} else { | |||
// Allocation failed. Restore node's local resources by freeing the resources | |||
// of the failed allocation. | |||
FreeTaskResourceInstances(task_allocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuisawesome I think it's a bug here, but not sure, could you confirm it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuisawesome in case you missed this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuisawesome please verify if this makes sense. @iycheng is this related to this PR? Can we make a separate PR for the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just a fix for this one but feel like worth separating for it. I'm not sure whether it's easy to add a unit test for this as well. Lack of bandwidth for that if it's too difficult. I'm ok to roll it back and create an issue if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems like the scenario you are thinking is the allocation only contains the subset of custom resources right?
e.g.,
Task: custom_1: 1.0, custom_2: 1.0
Node: custom_1: 1.0
and this won't free custom_1.
I think this should be easy to unit test? If we add this in this PR, I think we should add unit test at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then let me roll it back for now and create a p1 ticket for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkooo567 I think this code path will only be reached in this case:
- placement group created
- schedule with that pg (before running this function)
- placement group returned
- fail to allocate the resource for placement group
The case you mentioned won't trigger this error, since it won't even be scheduled to this node.
@@ -313,13 +316,6 @@ std::string NodeResources::DebugString(StringIdMap string_to_in_map) const { | |||
return buffer.str(); | |||
} | |||
|
|||
const std::string format_resource(std::string resource_name, double quantity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate function in scheduling_resources.h
I will leave the label until tests are actually passing? If you'd like us to see the PR before that, let us know. |
Due to the specialty of actor creation job, I only add that for the ready task. The above feature can be added later. |
I think the general approach makes sense to me. Btw, did you verify with import ray
import time
ray.init()
@ray.remote
class ChildActor:
def __init__(self):
self.val = 1
@ray.remote
class RemoteActor:
def __init__(self):
self.val = 1
main_actors = []
child_actors = []
for num_children in [10, 1, 20, 10, 5] * 4:
pg = ray.util.placement_group(bundles=[{"CPU": 1}] + [{"CPU": 0.01}] * num_children)
while not ray.get(pg.ready()):
pass
main_actor = RemoteActor.options(num_cpus=1, placement_group=pg, placement_group_bundle_index=0).remote()
main_actors.append(main_actor)
for j in range(num_children):
actor = ChildActor.options(
placement_group=pg,
num_cpus=0.01).remote()
child_actors.append(actor)
pg.ready()
print("Created actor with", num_children, "children")
time.sleep(1) |
@@ -28,6 +28,30 @@ def method(self, x): | |||
return x + 2 | |||
|
|||
|
|||
@pytest.mark.parametrize("connect_to_client", [False, True]) | |||
def test_placement_ready(ray_start_cluster, connect_to_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_placement_ready(ray_start_cluster, connect_to_client): | |
def test_placement_ready(ray_start_regular_shared, connect_to_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And remove the corresponding cluster utils stuff)
@rkooo567 it's running good with this script. |
@@ -775,6 +773,9 @@ bool ClusterResourceScheduler::AllocateTaskResourceInstances( | |||
} | |||
} | |||
} else { | |||
// Allocation failed. Restore node's local resources by freeing the resources | |||
// of the failed allocation. | |||
FreeTaskResourceInstances(task_allocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuisawesome please verify if this makes sense. @iycheng is this related to this PR? Can we make a separate PR for the fix?
with connect_to_client_or_not(connect_to_client): | ||
pg = ray.util.placement_group(bundles=[{"CPU": 1}]) | ||
ray.get(pg.ready()) | ||
a = Actor.options(num_cpus=1, placement_group=pg).remote() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right test isn't this? We should test with the default actor that requires 0 cpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this PR only fix the issue #15842 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I see. Can you actually add a comment here to explain what the test is doing? I think you can say;
# Test when all the bundle is occupied by the placement group, ready task is still schedulable
Also, I prefer to test 0 cpu case because that's the behavior change the PR will invoke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh actually, I saw #16819 (comment).
Does that mean this won't work for 0 cpu actor case still right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last followup after sync:
- Comment what the test is doing.
- Make the bundle resource private?
Roll it back and let's follow the bug in a separate ticket #17044 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow up with #16819 (comment)? That's the last thing I'd like to discuss. Other parts LGMT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the last comments are addressed https://github.com/ray-project/ray/pull/16819/files#r669822319
Why are these changes needed?
Right now, we have a bug where the waiting jobs will fail due to resources capacity.
This PR fixed it by adding bundle id as an implicit flag.
TODO: add this flag to all jobs when the bundle is assigned.
Related issue number
Closes #15842
Checks
scripts/format.sh
to lint the changes in this PR.