[sgd] Replaced class Resources in sgd with `use_gpu` #5252

jichan3751 · 2019-07-23T10:09:47Z

What do these changes do?

Removes duplicate class Resources in python/ray/experimental/sgd/pytorch/utils.py and replace with class Resources in /python/ray/tune/trial.py.

Related issue number

Linter

I've run scripts/format.sh to lint the changes in this PR.

AmplabJenkins · 2019-07-23T12:48:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15592/
Test FAILed.

richardliaw · 2019-07-23T17:17:36Z

python/ray/experimental/sgd/pytorch/pytorch_trainer.py

@@ -65,8 +66,7 @@ def __init__(self,
        self.optimizer_timer = utils.TimerStat(window_size=1)

        if resources_per_replica is None:
-            resources_per_replica = utils.Resources(
-                num_cpus=1, num_gpus=0, resources={})
+            resources_per_replica = Resources(cpu=1, gpu=0)

        if backend == "auto":
            backend = "nccl" if resources_per_replica.num_gpus > 0 else "gloo"


everywhere using num_gpus -> gpu

int(use_gpu)

richardliaw · 2019-07-23T17:22:18Z

python/ray/experimental/sgd/pytorch/__init__.py

@@ -3,6 +3,6 @@
 from __future__ import print_function

 from ray.experimental.sgd.pytorch.pytorch_trainer import PyTorchTrainer
-from ray.experimental.sgd.pytorch.utils import Resources
+from ray.tune.trial import Resources


let's move Resources into a new file ray.tune.resources

richardliaw · 2019-07-23T18:02:00Z

OK actually I got it; this change is fine so far. Let's make the SGD API simpler in this PR, and I think the work you've done so far is in progress towards that goal.

Can you implement the following:

class PyTorchTrainer(object):
    ...
    def __init__(self,
                 model_creator,
                 data_creator,
                 optimizer_creator=utils.sgd_mse_optimizer,
                 config=None,
                 num_replicas=1,
                 use_gpu=False,  # Make this easier
                 batch_size=16,
                 backend="auto"):

    @classmethod
    def default_resource_request(cls, resource):
        # currently wrong syntax
        return Resources(
            cpu=0,
            gpu=0,
            extra_cpu=cf["num_workers"],
            extra_gpu=int(use_gpu)* cf["num_workers"])

jichan3751 · 2019-07-26T08:50:41Z

In order to apply 'use_gpu' and 'default_resource_request' I think I need to know what extra_gpu , extra_cpus mean in class Resources.
Also, does removing resource_per_replica mean that user cannot specify to multiple gpus for each replica?

AmplabJenkins · 2019-07-26T09:34:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15678/
Test FAILed.

AmplabJenkins · 2019-07-26T09:38:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15676/
Test FAILed.

AmplabJenkins · 2019-07-26T11:44:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15679/
Test FAILed.

richardliaw · 2019-07-26T18:10:14Z

python/ray/tune/trial.py

@@ -300,7 +162,7 @@ def __init__(self,
                if resources:
                    raise ValueError(
                        "Resources for {} have been automatically set to {} "
-                        "by its `default_resource_request()` method. Please "
+                    "by its `default_resource_request()` method. Please "


richardliaw · 2019-07-26T18:27:06Z

In order to apply 'use_gpu' and 'default_resource_request' I think I need to know what extra_gpu , extra_cpus mean in class Resources.
Ah ok, extra_gpus and extra_cpus are the resources that will be needed to run the training, but are not going to be used by the head process. For example, say I have 4 processes holding model replicas for distributed training. I also have 1 process that does coordination. The coordination process may be the main process, and it will launch the for 4 replica processes. Here, I would want 1 CPU and 0 GPU for the coordinator, and its 4 child processes I want 1 CPU and 1 GPU.

This would correspond to Resources(cpu=1, gpu=0, extra_cpus=4, extra_gpus=4).

Also, does removing resource_per_replica mean that user cannot specify to multiple gpus for each replica?

Yes, I think that's ok for now.

jichan3751 · 2019-07-30T09:00:13Z

Can we try to merge this PR only with changing 'Resources' in experimental.sgd and moving Resources to new file ray.tune.resources first, and work on the default_resource_request, and int(use_gpu) with pytorch trainable class with seperate PR?

AmplabJenkins · 2019-07-30T20:40:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15800/
Test FAILed.

AmplabJenkins · 2019-07-30T20:47:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15801/
Test FAILed.

AmplabJenkins · 2019-07-30T21:23:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15803/
Test PASSed.

AmplabJenkins · 2019-07-30T22:31:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15806/
Test PASSed.

AmplabJenkins · 2019-07-31T02:32:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15814/
Test PASSed.

AmplabJenkins · 2019-07-31T03:03:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15816/
Test FAILed.

AmplabJenkins · 2019-07-31T03:05:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15815/
Test FAILed.

AmplabJenkins · 2019-07-31T03:07:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15818/
Test FAILed.

AmplabJenkins · 2019-07-31T03:51:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15817/
Test FAILed.

AmplabJenkins · 2019-07-31T21:55:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15842/
Test PASSed.

[sgd] Replaced class Resources to the one in ray.tune.trial

947adb6

richardliaw reviewed Jul 23, 2019

View reviewed changes

jichan3751 added 4 commits July 26, 2019 00:17

Merge branch 'master' into remove_dup_resource

192db10

moved Resources from ray.tune.trial to new file ray.tune.resources

8dfae61

added new file ray/tune/resources.py

8900d08

fix bugs

a17620a

fixed bugs after running test_pytorch.py

c8efb95

richardliaw reviewed Jul 26, 2019

View reviewed changes

richardliaw added 7 commits July 30, 2019 10:37

Fix lint

8976b3b

Merge branch 'master' into remove_dup_resource

648111c

Set use_gpu

ff69de3

lint

234adbb

lint

adac4ec

use_gpu - erroring at multinode execution

844ba58

bash

18d2dc5

richardliaw added 2 commits July 30, 2019 15:17

info messages

c901443

env vars

ec9a521

richardliaw added 3 commits July 30, 2019 15:47

prints

4474ce0

remove dup

1d3ee08

lint

24a258b

richardliaw added 4 commits July 30, 2019 21:10

fix

08b3b36

Revert changes for debugging.

04828e6

backend

c655bdf

nit

e693521

richardliaw changed the title ~~[sgd] Replaced class Resources in sgd to the one in ray.tune.trial~~ [sgd] Replaced class Resources in sgd with use_gpu Jul 31, 2019

richardliaw approved these changes Aug 1, 2019

View reviewed changes

richardliaw merged commit bd6dfc9 into ray-project:master Aug 1, 2019

edoakes pushed a commit to edoakes/ray that referenced this pull request Aug 9, 2019

[sgd] Replaced class Resources in sgd with use_gpu (ray-project#5252)

7d329b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sgd] Replaced class Resources in sgd with `use_gpu` #5252

[sgd] Replaced class Resources in sgd with `use_gpu` #5252

jichan3751 commented Jul 23, 2019 •

edited

Loading

AmplabJenkins commented Jul 23, 2019

This comment was marked as resolved.

richardliaw Jul 23, 2019

This comment was marked as resolved.

richardliaw Jul 23, 2019

richardliaw Jul 23, 2019 •

edited

Loading

richardliaw commented Jul 23, 2019

jichan3751 commented Jul 26, 2019 •

edited

Loading

AmplabJenkins commented Jul 26, 2019

AmplabJenkins commented Jul 26, 2019

AmplabJenkins commented Jul 26, 2019

richardliaw Jul 26, 2019

richardliaw commented Jul 26, 2019

jichan3751 commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

[sgd] Replaced class Resources in sgd with use_gpu #5252

[sgd] Replaced class Resources in sgd with use_gpu #5252

Conversation

jichan3751 commented Jul 23, 2019 • edited Loading

What do these changes do?

Related issue number

Linter

AmplabJenkins commented Jul 23, 2019

This comment was marked as resolved.

richardliaw Jul 23, 2019

Choose a reason for hiding this comment

This comment was marked as resolved.

richardliaw Jul 23, 2019

Choose a reason for hiding this comment

richardliaw Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

richardliaw commented Jul 23, 2019

jichan3751 commented Jul 26, 2019 • edited Loading

AmplabJenkins commented Jul 26, 2019

AmplabJenkins commented Jul 26, 2019

AmplabJenkins commented Jul 26, 2019

richardliaw Jul 26, 2019

Choose a reason for hiding this comment

richardliaw commented Jul 26, 2019

jichan3751 commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 30, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

AmplabJenkins commented Jul 31, 2019

[sgd] Replaced class Resources in sgd with `use_gpu` #5252

[sgd] Replaced class Resources in sgd with `use_gpu` #5252

jichan3751 commented Jul 23, 2019 •

edited

Loading

richardliaw Jul 23, 2019 •

edited

Loading

jichan3751 commented Jul 26, 2019 •

edited

Loading