[sgd] add support for additional resources per worker #18327

matthewdeng · 2021-09-03T04:13:51Z

Extend the Trainer constructor to support resources_per_worker argument.

Why are these changes needed?

This allows the user to input resource requests for custom resources.

This also allows the user to override the number of CPUs and GPUs that each worker will reserve by specifying entries for "CPU" or "GPU" (case-sensitive). The number of resources can be fractional (between 0 and 1) or multiple (integers greater than 1).

GPU Validation:
Special validation is done to ensure that any requested "GPU" values is consistent with the use_gpu argument.

By default (if "GPU" is not a key in resources_per_worker), 1 GPU will be requested if use_gpu is True and 0 if False.
If use_gpu is True and "GPU" is present in resources_per_worker, its value must be greater than 0.
If use_gpu is False and "GPU" is present in resources_per_worker, its value must be exactly 0. Note that this is equivalent to the default behavior.

Implementation

Trainer handles separating CPU and GPU requests from additional (custom) resource requests, and performs GPU request validation.
BackendExecutor is a simple passthrough from Trainer to WorkerGroup.
Currently WorkerGroup does a simple passthrough of the resources to ray.remote. Validation should be added here to ensure that the requested resources can be fulfilled, and raise an error if it cannot.

Resource Fulfillment Validation

The current implementation solves for the happy path in which the aggregate amount of resources requested across all the workers can be successfully fulfilled. When the resource request cannot be fulfilled, the user script will hang indefinitely as it waits for resources that may never be available.

In an ideal world, the WorkerGroup should detect when the resource request cannot be fulfilled, and raise an error. A simple (incomplete) solution would be to compare the number of resources requested with ray.available_resources(). For a more robust solution the following must be taken into consideration:

If Autoscaling is enabled, the amount of available resources may change to support the request.
If there are multiple processes in the same Ray cluster, they may contend for available resources.
When placement group support is added, resources will be requested earlier when the placement group is created, rather than when the workers are created.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…to sgd-resources

…resources

amogkam

LGTM! Can you check lint?

python/ray/util/sgd/v2/trainer.py

xwjiang2010 · 2021-09-03T15:11:18Z

LGTM!
Just a general comment - I see that SGDV2, RLlib and RayTune - all have different ways of specifying resources. I put some logic in trial_executor to give user more information. In the long run, we can put together a help page to show more detailed instructions.

matthewdeng added 4 commits September 2, 2021 16:46

[sgd] add support for additional resources per worker

043c047

[sgd] add support for additional resources per worker

f644af8

Merge branch 'sgd-resources' of https://github.com/matthewdeng/ray in…

0bbf901

…to sgd-resources

update test

435f0e0

matthewdeng requested a review from xwjiang2010 September 3, 2021 04:13

matthewdeng assigned richardliaw and amogkam Sep 3, 2021

Merge branch 'master' of https://github.com/ray-project/ray into sgd-…

f2e88d3

…resources

amogkam approved these changes Sep 3, 2021

View reviewed changes

python/ray/util/sgd/v2/trainer.py Outdated Show resolved Hide resolved

matthewdeng added 2 commits September 3, 2021 08:43

lint

0c102ad

update comments for case-sensitivity

8d712c5

amogkam merged commit 26f73eb into ray-project:master Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sgd] add support for additional resources per worker #18327

[sgd] add support for additional resources per worker #18327

matthewdeng commented Sep 3, 2021

amogkam left a comment

xwjiang2010 commented Sep 3, 2021

[sgd] add support for additional resources per worker #18327

[sgd] add support for additional resources per worker #18327

Conversation

matthewdeng commented Sep 3, 2021

Why are these changes needed?

Implementation

Resource Fulfillment Validation

Checks

amogkam left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Sep 3, 2021