Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sgd] add support for additional resources per worker #18327

Merged
merged 7 commits into from
Sep 3, 2021

Conversation

matthewdeng
Copy link
Contributor

Extend the Trainer constructor to support resources_per_worker argument.

Why are these changes needed?

This allows the user to input resource requests for custom resources.

This also allows the user to override the number of CPUs and GPUs that each worker will reserve by specifying entries for "CPU" or "GPU" (case-sensitive). The number of resources can be fractional (between 0 and 1) or multiple (integers greater than 1).

GPU Validation:
Special validation is done to ensure that any requested "GPU" values is consistent with the use_gpu argument.

  1. By default (if "GPU" is not a key in resources_per_worker), 1 GPU will be requested if use_gpu is True and 0 if False.
  2. If use_gpu is True and "GPU" is present in resources_per_worker, its value must be greater than 0.
  3. If use_gpu is False and "GPU" is present in resources_per_worker, its value must be exactly 0. Note that this is equivalent to the default behavior.

Implementation

  1. Trainer handles separating CPU and GPU requests from additional (custom) resource requests, and performs GPU request validation.
  2. BackendExecutor is a simple passthrough from Trainer to WorkerGroup.
  3. Currently WorkerGroup does a simple passthrough of the resources to ray.remote. Validation should be added here to ensure that the requested resources can be fulfilled, and raise an error if it cannot.

Resource Fulfillment Validation

The current implementation solves for the happy path in which the aggregate amount of resources requested across all the workers can be successfully fulfilled. When the resource request cannot be fulfilled, the user script will hang indefinitely as it waits for resources that may never be available.

In an ideal world, the WorkerGroup should detect when the resource request cannot be fulfilled, and raise an error. A simple (incomplete) solution would be to compare the number of resources requested with ray.available_resources(). For a more robust solution the following must be taken into consideration:

  1. If Autoscaling is enabled, the amount of available resources may change to support the request.
  2. If there are multiple processes in the same Ray cluster, they may contend for available resources.
  3. When placement group support is added, resources will be requested earlier when the placement group is created, rather than when the workers are created.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you check lint?

@xwjiang2010
Copy link
Contributor

LGTM!
Just a general comment - I see that SGDV2, RLlib and RayTune - all have different ways of specifying resources. I put some logic in trial_executor to give user more information. In the long run, we can put together a help page to show more detailed instructions.

@amogkam amogkam merged commit 26f73eb into ray-project:master Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants