-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of concurrent workers started by hardware concurrency. #2753
Limit number of concurrent workers started by hardware concurrency. #2753
Conversation
Just chatted with @pcmoritz and there is some concern about |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
// The second-last arugment is the worker command for Python, not needed for Java. | ||
String[] cmds = new String[]{filePath, rayletSocketName, storeName, ip, gcsIp, | ||
gcsPort, "" + numWorkers, resourceArgument, | ||
gcsPort, "" + numWorkers, "" + maximumStartupConcurrency, resourceArgument, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.valueOf(maximumStartupConcurrency);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll also change it for "" + numWorkers
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
const std::unordered_map<Language, std::vector<std::string>> &worker_commands) | ||
: num_workers_per_process_(num_workers_per_process), num_cpus_(num_cpus) { | ||
: num_workers_per_process_(num_workers_per_process), | ||
maximum_startup_concurrency_(maximum_startup_concurrency) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should set maximum_startup_concurrency_
to at least 1, in case user pass 0 or a negative number by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think 0 should be allowed as a way to specify that no workers should be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If maximum_startup_concurrency_
is 0, no workers will ever be created. I think that's useless?
Did you mean the number of initial workers to create when starting? that is another parameter, num_worker_processes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it's not necessarily useless, because you can still submit tasks and get objects. It just means that that node won't execute any tasks.
However, I think the appropriate way to specify that no tasks should be execute on a node is to set the resources to be 0 instead of setting the maximum_startup_concurrency_
to 0. E.g., if we have 1 CPU resource and maximum_startup_concurrency_
then that could lead to hanging.
Maybe we should simply do RAY_CHECK(maximum_startup_concurrency_ > 0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed this change, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the ability to create a ray cluster node that's guaranteed to accept no work. I commonly use this for testing local vs. remote task execution throughput and object store throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atumanov thanks, that's a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atumanov @raulchen I think the "correct" way to specify that no tasks should run at a node is to specify 0 resources as opposed to limiting the worker startup concurrency.
Especially since a 0-resource node can still execute tasks that require 0 resources, so we may need to start workers even on nodes with 0 resources.
@raulchen about the RAY_CHECK
, changing the value sounds like it could potentially hide errors or cause confusion. If 0
is not a valid input, then I'd prefer to fail immediately (of course it should not be possible to trigger the fatal code path from the Python/Java side).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertnishihara , yeah that's right. I think of ray nodes in terms of their multi-dimensional resource capacity. I think that the only units of compute that may require zero resources are actor tasks, in which case the actor creation task must've required resources, which would not have been placed on the zero-cpu node. It's a good sanity check and probably a good time to ensure that we don't have (don't support?) cases where the sum of actor creation resreq and actor task resreq is ever zero. That just wouldn't make sense to me.
FWIW, thus far, whenever I wanted a "no-op" ray node (one that accepts no work), I specify --num-cpus 0 --num-workers 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that specifying 0 resource is the proper way for no-op nodes.
@robertnishihara regarding the RAY_CHECK
, your reasoning makes sense to me. Previously I was concerned that users might accidentally specify 0 cpu and have no idea what went wrong (new users may not know where to find the log file). Now since we ensure the argument is at least 1 in Python & Java. This won't happen.
@@ -623,9 +623,11 @@ private void startRaylet(String storeName, AddressInfo info, int numWorkers, | |||
|
|||
String resourceArgument = ResourceUtil.getResourcesStringFromMap(staticResources); | |||
|
|||
int maximumStartupConcurrency = staticResources.get("CPU").intValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to psutil.cpu_count()
in Python, Java has Runtime.getRuntime().availableProcessors()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that in to match the Python side.
Test PASSed. |
Test PASSed. |
8530a7d
to
165790f
Compare
Test FAILed. |
@raulchen I'm happy with the current state of the PR, so please let me know if there are additional changes that should be made. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the number of workers is specified on the command line or through the ray.init
API, the maximum startup cap won't be honored due to the semantics of num_workers
. We treat the explicit specification of num_workers as the request from the user to start all of them on raylet startup (force_start
boolean flag). The suggestion is to (a) deprecate num_workers or (b) break this guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR looks good.
Thanks @raulchen, want to go ahead and merge it? Test failures seem unrelated to me. |
@robertnishihara I was checking the test failure as well. It should be unrelated, but I just re-ran the failed one to make sure. I'll merge this a bit later when it finishes. |
OK, I don't think I made myself clear. This PR does NOT fix #2751 when |
@atumanov sounds good, I'll push a change. |
@atumanov should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The re-triggered CI job has succeeded. And latest 2 commits only touches comments. It should be safe for merge now. |
Test FAILed. |
Test FAILed. |
This fixes #2751.
This is related to some discussion in #2168.