-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating zero capacity resource semantics #4555
Updating zero capacity resource semantics #4555
Conversation
*value = resource_capacity_.at(resource_name); | ||
return true; | ||
double capacity = resource_capacity_.at(resource_name); | ||
RAY_CHECK(capacity > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we checking for float precision errors here? Something like RAY_CHECK(capacity > 0 - std::numeric_limits<double>::epsilon())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either way is fine. If it's not causing any issues right now, then we can leave it. @williamma12 is fixing this properly in a separate PR.
Can one of the admins verify this patch? |
Test FAILed. |
2aa7e80
to
b29bbcc
Compare
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. Looks good to me overall. But @robertnishihara knows better about this code. I'll let him give the approval.
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
jenkins retest this please |
Test FAILed. |
except Exception: | ||
# TODO(rliaw): Remove this when local mode is fixed. | ||
# https://github.com/ray-project/ray/issues/4147 | ||
logger.debug("Using resources for local machine.") | ||
resources = ray.services.check_and_update_resources( | ||
None, None, None) | ||
if not resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardliaw We're changing resource semantics - zero capacity resources are now not included in resource datastructures. For instance instead of returning {resource: 0} we now return an empty dictionary.
To make tests work, I've made some changes to resource handling in tune - can you please check if this looks okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a small tweak - thanks for letting me know!
Test FAILed. |
jenkins retest this please |
Test FAILed. |
Test FAILed. |
@richardliaw @romilbhardwaj looks like the Jenkins tune test is still failing. |
Test FAILed. |
3fe4ff5
to
04d23b6
Compare
retry_count = 0 | ||
|
||
while resources and retry_count < 5: | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine, but 5 retries may not be enough in an environment like Travis, so let's keep an eye on whether this test becomes flaky or not and if it starts failing then increase this.
Test FAILed. |
Fix typo More fixes. Updates to python functions debug statements Rounding error fixes, removing cpu addition in cython and test fixes. linting Fix worker pool test python linting
Test FAILed. |
Test FAILed. |
Test FAILed. |
79d46e0
to
74f647a
Compare
Test FAILed. |
Test FAILed. |
Test FAILed. |
|
||
public BaseTaskOptions() { | ||
resources = new HashMap<>(); | ||
} | ||
|
||
public BaseTaskOptions(Map<String, Double> resources) { | ||
for (Map.Entry<String, Double> entry : resources.entrySet()) { | ||
if (entry.getValue().compareTo(0.0) <= 0) { | ||
throw new RuntimeException(String.format("Resource capacity should be positive, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use IllegalArgumentException
here
CallOptions callOptions3 = new CallOptions(ImmutableMap.of("CPU", 0.0)); | ||
Assert.fail(); | ||
} catch (RuntimeException e) { | ||
// We should receive a RuntimeException indicate that we should pass a zero capacity resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We should receive a RuntimeException indicate that we should pass a zero capacity resource. | |
// We should receive a RuntimeException indicates that we should not pass a zero capacity resource. |
Test FAILed. |
I believe that this commit may be breaking autoscaling. With a head node with num-cpus=1, this commit results in the autoscaler LoadMetrics reporting an empty dict for both static and dynamic resouces once 1 job is scheduled on the head node. This means that the head node never scales up. |
@ls-daniel thanks for reporting this. We'll look into it. |
This reverts commit 0f42f87.
This reverts commit 0f42f87.
This PR contains changes that help with memory issues ray-project#4555
What do these changes do?
This PR updates the resource quantity semantics. This PR onwards, a resource with capacity 0 implies it resource does not exist, and consequently no data structure must store a resource with capacity zero.
Linter
scripts/format.sh
to lint the changes in this PR.