[sgd] add support for additional resources per worker #18327
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extend the
Trainer
constructor to supportresources_per_worker
argument.Why are these changes needed?
This allows the user to input resource requests for custom resources.
This also allows the user to override the number of CPUs and GPUs that each worker will reserve by specifying entries for
"CPU"
or"GPU"
(case-sensitive). The number of resources can be fractional (between 0 and 1) or multiple (integers greater than 1).GPU Validation:
Special validation is done to ensure that any requested
"GPU"
values is consistent with theuse_gpu
argument."GPU"
is not a key inresources_per_worker
), 1 GPU will be requested ifuse_gpu
isTrue
and 0 ifFalse
.use_gpu
isTrue
and"GPU"
is present inresources_per_worker
, its value must be greater than 0.use_gpu
isFalse
and"GPU"
is present inresources_per_worker
, its value must be exactly 0. Note that this is equivalent to the default behavior.Implementation
Trainer
handles separating CPU and GPU requests from additional (custom) resource requests, and performs GPU request validation.BackendExecutor
is a simple passthrough fromTrainer
toWorkerGroup
.WorkerGroup
does a simple passthrough of the resources toray.remote
. Validation should be added here to ensure that the requested resources can be fulfilled, and raise an error if it cannot.Resource Fulfillment Validation
The current implementation solves for the happy path in which the aggregate amount of resources requested across all the workers can be successfully fulfilled. When the resource request cannot be fulfilled, the user script will hang indefinitely as it waits for resources that may never be available.
In an ideal world, the
WorkerGroup
should detect when the resource request cannot be fulfilled, and raise an error. A simple (incomplete) solution would be to compare the number of resources requested withray.available_resources()
. For a more robust solution the following must be taken into consideration:Checks
scripts/format.sh
to lint the changes in this PR.