Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix releasing CPUs incorrectly when actor creation task blocked. #5271

Merged

Conversation

jovany-wang
Copy link
Contributor

The following test can't pass before this PR:

    ray.init(num_cpus=2)

     @ray.remote(num_cpus=1)
    def get_100():
        import time
        time.sleep(2)
        return 100

     @ray.remote(num_cpus=1)
    class A(object):
        def __init__(self):
            self.num = ray.get(get_100.remote())

         def get_num(self):
            return self.num

     a = A.remote()
    assert 100 == ray.get(a.get_num.remote())
    assert 1 == ray.available_resources()["CPU"]

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1816/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15642/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15643/
Test PASSed.

@raulchen
Copy link
Contributor

@romilbhardwaj I think this bug was introduced by dynamic resource. Can you take a look?

@raulchen
Copy link
Contributor

@robertnishihara do you know why we need this local_available_resources_ while we already have cluster_resource_map_? why not just use cluster_resource_map_[local_node_id]?

Copy link
Member

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR. The change in node_manager.cc makes sense. However, I'm a bit confused about the test - please see my comment there.

@romilbhardwaj
Copy link
Member

@robertnishihara do you know why we need this local_available_resources_ while we already have cluster_resource_map_? why not just use cluster_resource_map_[local_node_id]?

I think cluster_resource_map_ is a simple map of resource type and quantities used for making scheduling decisions, while local_available_resources_ handles the resource id allocation and releases at the local raylet and is used for resource id bookkeeping. Maybe we could store resource ids in cluster_resource_map_ and use a single structure for both scheduling and bookkeeping.. @robertnishihara would be able to provide better insight.

@jovany-wang
Copy link
Contributor Author

jovany-wang commented Jul 26, 2019

@robertnishihara do you know why we need this local_available_resources_ while we already have cluster_resource_map_? why not just use cluster_resource_map_[local_node_id]?

I think cluster_resource_map_ is a simple map of resource type and quantities used for making scheduling decisions, while local_available_resources_ handles the resource id allocation and releases at the local raylet and is used for resource id bookkeeping. Maybe we could store resource ids in cluster_resource_map_ and use a single structure for both scheduling and bookkeeping.. @robertnishihara would be able to provide better insight.

If we can remove local_available_resources_, I beleive things will get more simple and clearer.

@robertnishihara
Copy link
Collaborator

@robertnishihara do you know why we need this local_available_resources_ while we already have cluster_resource_map_? why not just use cluster_resource_map_[local_node_id]?

@raulchen, the reason is that we do fine-grained bookkeeping for the local resources (we keep track of resource IDs), whereas for cluster resources, we only keep track of the aggregate resource quantities. There's probably a better way to do it though.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15724/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15723/
Test PASSed.

@jovany-wang jovany-wang merged commit 1465a30 into ray-project:master Jul 28, 2019
@jovany-wang jovany-wang deleted the fix-release-cpu-when-bolcking branch July 28, 2019 07:46
edoakes pushed a commit to edoakes/ray that referenced this pull request Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants