Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of concurrent workers started by hardware concurrency. #2753

Merged
merged 13 commits into from
Aug 29, 2018

Conversation

robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Aug 27, 2018

This fixes #2751.

This is related to some discussion in #2168.

@robertnishihara
Copy link
Collaborator Author

Just chatted with @pcmoritz and there is some concern about std::thread::hardware_concurrency being unreliable. One alternative would be to pass the concurrency in from the command line (and compute it in Python with psutil.cpu_count()).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7794/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7796/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7799/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7801/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7803/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7802/
Test FAILed.

// The second-last arugment is the worker command for Python, not needed for Java.
String[] cmds = new String[]{filePath, rayletSocketName, storeName, ip, gcsIp,
gcsPort, "" + numWorkers, resourceArgument,
gcsPort, "" + numWorkers, "" + maximumStartupConcurrency, resourceArgument,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.valueOf(maximumStartupConcurrency);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll also change it for "" + numWorkers then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

const std::unordered_map<Language, std::vector<std::string>> &worker_commands)
: num_workers_per_process_(num_workers_per_process), num_cpus_(num_cpus) {
: num_workers_per_process_(num_workers_per_process),
maximum_startup_concurrency_(maximum_startup_concurrency) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should set maximum_startup_concurrency_ to at least 1, in case user pass 0 or a negative number by accident.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think 0 should be allowed as a way to specify that no workers should be created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If maximum_startup_concurrency_ is 0, no workers will ever be created. I think that's useless?

Did you mean the number of initial workers to create when starting? that is another parameter, num_worker_processes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's not necessarily useless, because you can still submit tasks and get objects. It just means that that node won't execute any tasks.

However, I think the appropriate way to specify that no tasks should be execute on a node is to set the resources to be 0 instead of setting the maximum_startup_concurrency_ to 0. E.g., if we have 1 CPU resource and maximum_startup_concurrency_ then that could lead to hanging.

Maybe we should simply do RAY_CHECK(maximum_startup_concurrency_ > 0).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed this change, let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the ability to create a ray cluster node that's guaranteed to accept no work. I commonly use this for testing local vs. remote task execution throughput and object store throughput.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atumanov thanks, that's a good point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atumanov @raulchen I think the "correct" way to specify that no tasks should run at a node is to specify 0 resources as opposed to limiting the worker startup concurrency.

Especially since a 0-resource node can still execute tasks that require 0 resources, so we may need to start workers even on nodes with 0 resources.

@raulchen about the RAY_CHECK, changing the value sounds like it could potentially hide errors or cause confusion. If 0 is not a valid input, then I'd prefer to fail immediately (of course it should not be possible to trigger the fatal code path from the Python/Java side).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertnishihara , yeah that's right. I think of ray nodes in terms of their multi-dimensional resource capacity. I think that the only units of compute that may require zero resources are actor tasks, in which case the actor creation task must've required resources, which would not have been placed on the zero-cpu node. It's a good sanity check and probably a good time to ensure that we don't have (don't support?) cases where the sum of actor creation resreq and actor task resreq is ever zero. That just wouldn't make sense to me.

FWIW, thus far, whenever I wanted a "no-op" ray node (one that accepts no work), I specify --num-cpus 0 --num-workers 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that specifying 0 resource is the proper way for no-op nodes.

@robertnishihara regarding the RAY_CHECK, your reasoning makes sense to me. Previously I was concerned that users might accidentally specify 0 cpu and have no idea what went wrong (new users may not know where to find the log file). Now since we ensure the argument is at least 1 in Python & Java. This won't happen.

@@ -623,9 +623,11 @@ private void startRaylet(String storeName, AddressInfo info, int numWorkers,

String resourceArgument = ResourceUtil.getResourcesStringFromMap(staticResources);

int maximumStartupConcurrency = staticResources.get("CPU").intValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to psutil.cpu_count() in Python, Java has Runtime.getRuntime().availableProcessors()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that in to match the Python side.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7807/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7817/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7821/
Test FAILed.

@robertnishihara
Copy link
Collaborator Author

@raulchen I'm happy with the current state of the PR, so please let me know if there are additional changes that should be made.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7829/
Test PASSed.

Copy link
Contributor

@atumanov atumanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the number of workers is specified on the command line or through the ray.init API, the maximum startup cap won't be honored due to the semantics of num_workers. We treat the explicit specification of num_workers as the request from the user to start all of them on raylet startup (force_start boolean flag). The suggestion is to (a) deprecate num_workers or (b) break this guarantee.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks good.

@robertnishihara
Copy link
Collaborator Author

Thanks @raulchen, want to go ahead and merge it? Test failures seem unrelated to me.

@raulchen
Copy link
Contributor

@robertnishihara I was checking the test failure as well. It should be unrelated, but I just re-ran the failed one to make sure. I'll merge this a bit later when it finishes.

@atumanov
Copy link
Contributor

OK, I don't think I made myself clear. This PR does NOT fix #2751 when num_workers is specified. See my comment above for details. We may agree to merge this in its current form with the proviso that num_workers will be deprecated in the near future (#2759). But could we at least leave a comment to that effect in WorkerPool::StartWorkerProcess (where the force_start boolean flag overrides the effect of maximum_startup_concurrency_ ) ?
https://github.com/ray-project/ray/pull/2753/files#diff-64c50c0cc29e8f9f96ebf5df2b668f1cR108

@robertnishihara
Copy link
Collaborator Author

@atumanov sounds good, I'll push a change.

@robertnishihara
Copy link
Collaborator Author

@atumanov should be fixed now.

Copy link
Contributor

@atumanov atumanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The correctness is pending #2762 or #2759

@raulchen
Copy link
Contributor

The re-triggered CI job has succeeded. And latest 2 commits only touches comments. It should be safe for merge now.

@raulchen raulchen merged commit 132f133 into ray-project:master Aug 29, 2018
@robertnishihara robertnishihara deleted the startworkersslower branch August 29, 2018 06:55
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7846/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7849/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test/actor_test.py::ActorsWithGPUs::testActorMultipleGPUsFromMultipleTasks failing
5 participants