Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sgd] Replaced class Resources in sgd with use_gpu #5252

Merged
merged 22 commits into from
Aug 1, 2019

Conversation

jichan3751
Copy link
Contributor

@jichan3751 jichan3751 commented Jul 23, 2019

What do these changes do?

Removes duplicate class Resources in python/ray/experimental/sgd/pytorch/utils.py and replace with class Resources in /python/ray/tune/trial.py.

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15592/
Test FAILed.

@@ -65,8 +66,7 @@ def __init__(self,
self.optimizer_timer = utils.TimerStat(window_size=1)

if resources_per_replica is None:
resources_per_replica = utils.Resources(
num_cpus=1, num_gpus=0, resources={})
resources_per_replica = Resources(cpu=1, gpu=0)

if backend == "auto":
backend = "nccl" if resources_per_replica.num_gpus > 0 else "gloo"

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everywhere using num_gpus -> gpu

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(use_gpu)

@@ -3,6 +3,6 @@
from __future__ import print_function

from ray.experimental.sgd.pytorch.pytorch_trainer import PyTorchTrainer
from ray.experimental.sgd.pytorch.utils import Resources
from ray.tune.trial import Resources
Copy link
Contributor

@richardliaw richardliaw Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move Resources into a new file ray.tune.resources

@richardliaw
Copy link
Contributor

OK actually I got it; this change is fine so far. Let's make the SGD API simpler in this PR, and I think the work you've done so far is in progress towards that goal.

Can you implement the following:

class PyTorchTrainer(object):
    ...
    def __init__(self,
                 model_creator,
                 data_creator,
                 optimizer_creator=utils.sgd_mse_optimizer,
                 config=None,
                 num_replicas=1,
                 use_gpu=False,  # Make this easier
                 batch_size=16,
                 backend="auto"):

    @classmethod
    def default_resource_request(cls, resource):
        # currently wrong syntax
        return Resources(
            cpu=0,
            gpu=0,
            extra_cpu=cf["num_workers"],
            extra_gpu=int(use_gpu)* cf["num_workers"])

@jichan3751
Copy link
Contributor Author

jichan3751 commented Jul 26, 2019

In order to apply 'use_gpu' and 'default_resource_request' I think I need to know what extra_gpu , extra_cpus mean in class Resources.
Also, does removing resource_per_replica mean that user cannot specify to multiple gpus for each replica?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15678/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15676/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15679/
Test FAILed.

@@ -300,7 +162,7 @@ def __init__(self,
if resources:
raise ValueError(
"Resources for {} have been automatically set to {} "
"by its `default_resource_request()` method. Please "
"by its `default_resource_request()` method. Please "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(revert?)

@richardliaw
Copy link
Contributor

In order to apply 'use_gpu' and 'default_resource_request' I think I need to know what extra_gpu , extra_cpus mean in class Resources.
Ah ok, extra_gpus and extra_cpus are the resources that will be needed to run the training, but are not going to be used by the head process. For example, say I have 4 processes holding model replicas for distributed training. I also have 1 process that does coordination. The coordination process may be the main process, and it will launch the for 4 replica processes. Here, I would want 1 CPU and 0 GPU for the coordinator, and its 4 child processes I want 1 CPU and 1 GPU.

This would correspond to Resources(cpu=1, gpu=0, extra_cpus=4, extra_gpus=4).

Also, does removing resource_per_replica mean that user cannot specify to multiple gpus for each replica?

Yes, I think that's ok for now.

@jichan3751
Copy link
Contributor Author

Can we try to merge this PR only with changing 'Resources' in experimental.sgd and moving Resources to new file ray.tune.resources first, and work on the default_resource_request, and int(use_gpu) with pytorch trainable class with seperate PR?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15800/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15801/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15803/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15806/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15814/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15816/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15815/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15818/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15817/
Test FAILed.

@richardliaw richardliaw changed the title [sgd] Replaced class Resources in sgd to the one in ray.tune.trial [sgd] Replaced class Resources in sgd with use_gpu Jul 31, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15842/
Test PASSed.

@richardliaw richardliaw merged commit bd6dfc9 into ray-project:master Aug 1, 2019
edoakes pushed a commit to edoakes/ray that referenced this pull request Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants