Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating zero capacity resource semantics #4555

Merged

Conversation

romilbhardwaj
Copy link
Member

What do these changes do?

This PR updates the resource quantity semantics. This PR onwards, a resource with capacity 0 implies it resource does not exist, and consequently no data structure must store a resource with capacity zero.

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

*value = resource_capacity_.at(resource_name);
return true;
double capacity = resource_capacity_.at(resource_name);
RAY_CHECK(capacity > 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we checking for float precision errors here? Something like RAY_CHECK(capacity > 0 - std::numeric_limits<double>::epsilon())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either way is fine. If it's not causing any issues right now, then we can leave it. @williamma12 is fixing this properly in a separate PR.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13487/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13504/
Test FAILed.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Looks good to me overall. But @robertnishihara knows better about this code. I'll let him give the approval.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13541/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13555/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13566/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13568/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13600/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13620/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13632/
Test FAILed.

@romilbhardwaj
Copy link
Member Author

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13634/
Test FAILed.

except Exception:
# TODO(rliaw): Remove this when local mode is fixed.
# https://github.com/ray-project/ray/issues/4147
logger.debug("Using resources for local machine.")
resources = ray.services.check_and_update_resources(
None, None, None)
if not resources:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw We're changing resource semantics - zero capacity resources are now not included in resource datastructures. For instance instead of returning {resource: 0} we now return an empty dictionary.

To make tests work, I've made some changes to resource handling in tune - can you please check if this looks okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a small tweak - thanks for letting me know!

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13644/
Test FAILed.

@romilbhardwaj
Copy link
Member Author

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13646/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13647/
Test FAILed.

@robertnishihara
Copy link
Collaborator

@richardliaw @romilbhardwaj looks like the Jenkins tune test is still failing. resource can be {}, so we are hitting the if not resources: raise(...) code path.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13653/
Test FAILed.

@romilbhardwaj romilbhardwaj force-pushed the scheduling-res-zerocap branch from 3fe4ff5 to 04d23b6 Compare April 10, 2019 06:33
retry_count = 0

while resources and retry_count < 5:
time.sleep(0.1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but 5 retries may not be enough in an environment like Travis, so let's keep an eye on whether this test becomes flaky or not and if it starts failing then increase this.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13687/
Test FAILed.

Fix typo

More fixes.

Updates to python functions

debug statements

Rounding error fixes, removing cpu addition in cython and test fixes.

linting

Fix worker pool test

python linting
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13724/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13725/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13744/
Test FAILed.

@jovany-wang jovany-wang force-pushed the scheduling-res-zerocap branch from 79d46e0 to 74f647a Compare April 12, 2019 04:17
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13756/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13757/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13758/
Test FAILed.


public BaseTaskOptions() {
resources = new HashMap<>();
}

public BaseTaskOptions(Map<String, Double> resources) {
for (Map.Entry<String, Double> entry : resources.entrySet()) {
if (entry.getValue().compareTo(0.0) <= 0) {
throw new RuntimeException(String.format("Resource capacity should be positive, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use IllegalArgumentException here

CallOptions callOptions3 = new CallOptions(ImmutableMap.of("CPU", 0.0));
Assert.fail();
} catch (RuntimeException e) {
// We should receive a RuntimeException indicate that we should pass a zero capacity resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We should receive a RuntimeException indicate that we should pass a zero capacity resource.
// We should receive a RuntimeException indicates that we should not pass a zero capacity resource.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13766/
Test FAILed.

@robertnishihara robertnishihara merged commit 0f42f87 into ray-project:master Apr 12, 2019
@romilbhardwaj romilbhardwaj deleted the scheduling-res-zerocap branch April 13, 2019 00:05
@ls-daniel
Copy link
Contributor

I believe that this commit may be breaking autoscaling.

With a head node with num-cpus=1, this commit results in the autoscaler LoadMetrics reporting an empty dict for both static and dynamic resouces once 1 job is scheduled on the head node.

This means that the head node never scales up.

@robertnishihara
Copy link
Collaborator

@ls-daniel thanks for reporting this. We'll look into it.

@romilbhardwaj romilbhardwaj restored the scheduling-res-zerocap branch April 17, 2019 22:36
romilbhardwaj added a commit to romilbhardwaj/ray that referenced this pull request Apr 17, 2019
@romilbhardwaj romilbhardwaj mentioned this pull request Apr 18, 2019
1 task
ls-daniel added a commit to longshotsyndicate/ray that referenced this pull request Apr 18, 2019
devin-petersohn added a commit that referenced this pull request Apr 18, 2019
ls-daniel pushed a commit to longshotsyndicate/ray that referenced this pull request May 7, 2019
robertnishihara pushed a commit that referenced this pull request May 8, 2019
ls-daniel pushed a commit to longshotsyndicate/ray that referenced this pull request Jul 17, 2019
Edilmo added a commit to BonsaiAI/ray that referenced this pull request Feb 10, 2020
This PR contains changes that help with memory issues
ray-project#4555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants