-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] horovod trainable #10304
[tune] horovod trainable #10304
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Awesome to see so much progress so quickly.
return self.workers | ||
|
||
|
||
class Coordinator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this we can probably move into Horovod so we don't have to expose these internals (which may change) to Ray.
I think it would be good to have a simple horovod.ray.run
API to use. Would that be sufficient for Ray Tune?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. However, I would need a bit lower level than spark.run
. Specifically I would need to be able to pass in an arbitrary class and obtain a list of ray Actors with horovod started on them:
trainable = wrap_function(self.__class__.function)
assert type(trainable) == type # Ray Tune specific construct.
# encapsulate logic in horovod repo
actors = hvd.ray.start_actors(trainable, num_workers=100, elastic=False, use_gpu=True)
ray.get([a.method_foo.remote() for a in actors])
Note that this actually gives you a lot of flexibility. For example,
class CustomExecutor:
def execute(self, fn, args):
return fn(args)
actors = hvd.ray.start_actors(CustomExecutor)
def ray_hvd_run(*args, **kwargs):
return ray.get([a.execute.remote(*args, **kwargs) for a in actors])
def train_func(args):
hvd.init()
...
# something similar to spark run
ray_hvd_run(train_func, args="foobar")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically what I do in _HorovodTrainable.setup()
. If this sounds good, I could easily factor it into its own object/move it into Horovod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good. The two things I would add would be:
- I think it would be useful to wrap everything in some kind of "Job" object to manage the lifecycle of all the components.
- We could then build a higher level API on top of this for users who don't need the lower-level control.
So something like:
# module horovod.ray
def create_job(num_hosts, num_slots, executor_cls=_default_executor_cls):
...
def run(train_fn, args, kwargs, num_hosts, num_slots):
job = create_job(num_hosts, num_slots)
try:
job.start()
return job.execute(lambda w: w.execute(train_fn, *args, **kwargs))
finally:
job.stop()
Something like that. What do you think? Would that give you enough flexibility for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that definitely sounds good; let me push a refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far, but I will have to look more into the horovod job
node_id = f"node:{ray.services.get_node_ip_address()}" | ||
remote_cls = ray.remote(BaseHorovodWorker) | ||
remote_cls = remote_cls.options( | ||
num_cpus=0, num_gpus=0, resources={node_id: 0.01}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fail if we add more than 100 workers per node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; though most likely GPUs on a node will be limited to 16.
|
||
|
||
def test_colocator_gpu(tmpdir, ray_start_4_cpus_4_gpus): | ||
SetColocator = NodeColocator.options(num_cpus=4, num_gpus=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that fixture imported here? Seems only to be defined in test_horovod.py
.
We should also probably move this file to tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be included upstream in horovod.
Why are these changes needed?
This PR allows users to utilize Horovod with Ray Tune.
Caveats:
HOROVOD_WITH_GLOO
.TODO:
cc @tgaddair
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.