[tune] horovod trainable #10304

richardliaw · 2020-08-25T04:44:50Z

Why are these changes needed?

This PR allows users to utilize Horovod with Ray Tune.

Caveats:

It depends on the GLOO communicator - horovod must be installed with HOROVOD_WITH_GLOO.
It assumes that workers will be placed symmetrically across nodes (i.e., workers_per_node).
It has an unsafe way of handling identity strings. But there is a large warning in the docs.
NICs isolation/choice is currently unsupported. Not quite familiar with what to do here, but would be happy to push a fix if given tips.
Function checkpointing is currently unsupported (but would not be hard to do so).

def train(config):
    horovod.init()
    horovod.allreduce()

from ray.tune.integration.horovod import DistributedTrainableCreator
trainable_cls = DistributedTrainableCreator(
    train, num_nodes=1, num_workers_per_node=2, use_gpu=True)

tune.run(trainable_cls)

TODO:

Add tests
Add dependencies for tests
add documentation
add another example

cc @tgaddair

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

tgaddair

Nice! Awesome to see so much progress so quickly.

python/ray/tune/examples/horovod_simple.py

python/ray/tune/integration/horovod.py

tgaddair · 2020-08-28T01:41:14Z

python/ray/tune/integration/horovod.py

+        return self.workers
+
+
+class Coordinator:


Some of this we can probably move into Horovod so we don't have to expose these internals (which may change) to Ray.

I think it would be good to have a simple horovod.ray.run API to use. Would that be sufficient for Ray Tune?

I think so. However, I would need a bit lower level than spark.run. Specifically I would need to be able to pass in an arbitrary class and obtain a list of ray Actors with horovod started on them:

trainable = wrap_function(self.__class__.function) assert type(trainable) == type # Ray Tune specific construct. # encapsulate logic in horovod repo actors = hvd.ray.start_actors(trainable, num_workers=100, elastic=False, use_gpu=True) ray.get([a.method_foo.remote() for a in actors])

Note that this actually gives you a lot of flexibility. For example,

class CustomExecutor: def execute(self, fn, args): return fn(args) actors = hvd.ray.start_actors(CustomExecutor) def ray_hvd_run(*args, **kwargs): return ray.get([a.execute.remote(*args, **kwargs) for a in actors]) def train_func(args): hvd.init() ... # something similar to spark run ray_hvd_run(train_func, args="foobar")

This is basically what I do in _HorovodTrainable.setup(). If this sounds good, I could easily factor it into its own object/move it into Horovod.

That sounds good. The two things I would add would be:

I think it would be useful to wrap everything in some kind of "Job" object to manage the lifecycle of all the components.

We could then build a higher level API on top of this for users who don't need the lower-level control.

So something like:

# module horovod.ray def create_job(num_hosts, num_slots, executor_cls=_default_executor_cls): ... def run(train_fn, args, kwargs, num_hosts, num_slots): job = create_job(num_hosts, num_slots) try: job.start() return job.execute(lambda w: w.execute(train_fn, *args, **kwargs)) finally: job.stop()

Something like that. What do you think? Would that give you enough flexibility for this use case?

Yeah, that definitely sounds good; let me push a refactor.

krfricke

Looks good so far, but I will have to look more into the horovod job

python/ray/tune/integration/horovod.py

krfricke · 2020-08-31T12:58:01Z

python/ray/tune/integration/_horovod_job.py

+        node_id = f"node:{ray.services.get_node_ip_address()}"
+        remote_cls = ray.remote(BaseHorovodWorker)
+        remote_cls = remote_cls.options(
+            num_cpus=0, num_gpus=0, resources={node_id: 0.01})


Will this fail if we add more than 100 workers per node?

Yeah; though most likely GPUs on a node will be limited to 16.

krfricke · 2020-08-31T13:04:00Z

python/ray/tune/integration/test_horovod_job_unit.py

+
+
+def test_colocator_gpu(tmpdir, ray_start_4_cpus_4_gpus):
+    SetColocator = NodeColocator.options(num_cpus=4, num_gpus=4)


Is that fixture imported here? Seems only to be defined in test_horovod.py.

We should also probably move this file to tests

Yeah, this should be included upstream in horovod.

python/ray/tune/examples/horovod_simple.py

richardliaw added 4 commits August 24, 2020 21:44

horovod-init

3c9d758

horovod-simples

658b46a

fix

baa6a82

start-tests

db9dd07

richardliaw changed the title ~~[tune][wip] horovod trainable~~ [tune] horovod trainable Aug 26, 2020

richardliaw marked this pull request as ready for review August 26, 2020 00:56

richardliaw added 3 commits August 26, 2020 15:21

tests

90f5cf1

horovod

59a63c7

horovod

111979f

richardliaw requested review from krfricke and amogkam August 27, 2020 05:32

tgaddair reviewed Aug 28, 2020

View reviewed changes

richardliaw added 17 commits August 28, 2020 17:57

try-out-new-api

d947933

refactor

cc8c8f9

horovod-trainable

bd4abb7

typing

c26e91d

fixup

4e8b449

rayget

3d8e564

Merge branch 'master' into horovod-trainable

8baa3c1

Merge branch 'master' into horovod-trainable

328d25a

update

1847dda

fix-scripts

bc59127

update-horovod-executor

317453c

fix-docs

d1db3d9

fix

b38fd64

move-tests

a90f52f

fix-tests

6b339f9

fix-unit-test

045d366

fixup-tests

0011e6d

krfricke reviewed Aug 31, 2020

View reviewed changes

richardliaw added 12 commits September 1, 2020 13:07

Merge branch 'master' into horovod-trainable

e67b2b3

upstreamed horovod tests

9604b09

basic-integration

222fe56

update-docs

50c2afb

Merge branch 'master' into horovod-trainable

713b093

fix

b3e485a

fixformat

a4a84a9

docs

471ce03

fix

8fe444a

comment

a57fbd6

docs

fefe50c

fixsimple

cdceb7c

tgaddair reviewed Sep 1, 2020

View reviewed changes

python/ray/tune/examples/horovod_simple.py Outdated Show resolved Hide resolved

tgaddair reviewed Sep 1, 2020

View reviewed changes

python/ray/tune/examples/horovod_simple.py Outdated Show resolved Hide resolved

richardliaw added 5 commits September 1, 2020 19:32

replicate

1537837

arg

83ee248

fix-slot

02c2176

cmake

0ff033c

without-mxnet

4fa7a01

richardliaw added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 3, 2020

horovod

d80af1c

amogkam approved these changes Sep 3, 2020

View reviewed changes

richardliaw added 2 commits September 3, 2020 12:46

fast

4120957

fix

87d5d3b

richardliaw merged commit 43a7a64 into ray-project:master Sep 3, 2020

richardliaw deleted the horovod-trainable branch September 3, 2020 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] horovod trainable #10304

[tune] horovod trainable #10304

richardliaw commented Aug 25, 2020 •

edited

Loading

tgaddair left a comment

tgaddair Aug 28, 2020

richardliaw Aug 28, 2020 •

edited

Loading

richardliaw Aug 28, 2020 •

edited

Loading

tgaddair Aug 28, 2020

richardliaw Aug 29, 2020

krfricke left a comment

krfricke Aug 31, 2020

richardliaw Sep 1, 2020

krfricke Aug 31, 2020

richardliaw Sep 1, 2020



		def test_colocator_gpu(tmpdir, ray_start_4_cpus_4_gpus):
		SetColocator = NodeColocator.options(num_cpus=4, num_gpus=4)

[tune] horovod trainable #10304

[tune] horovod trainable #10304

Conversation

richardliaw commented Aug 25, 2020 • edited Loading

Why are these changes needed?

Related issue number

Checks

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Aug 28, 2020

Choose a reason for hiding this comment

richardliaw Aug 28, 2020 • edited Loading

Choose a reason for hiding this comment

richardliaw Aug 28, 2020 • edited Loading

Choose a reason for hiding this comment

tgaddair Aug 28, 2020

Choose a reason for hiding this comment

richardliaw Aug 29, 2020

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke Aug 31, 2020

Choose a reason for hiding this comment

richardliaw Sep 1, 2020

Choose a reason for hiding this comment

krfricke Aug 31, 2020

Choose a reason for hiding this comment

richardliaw Sep 1, 2020

Choose a reason for hiding this comment

richardliaw commented Aug 25, 2020 •

edited

Loading

richardliaw Aug 28, 2020 •

edited

Loading

richardliaw Aug 28, 2020 •

edited

Loading