-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-37] tutorial for distributed training #9152
Conversation
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
and removing pythonpath inserts for get_data, by moving them to test_utils Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Punctuation and minor changes
and address some review comments Signed-off-by: Rahul <rahulhuilgol@gmail.com>
indentation change
cmake instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nits:
- Use Capitals for Major Words in Your Titles Like This
- Try to keep the headings down to ### or #### at most. You shouldn't really jump from ## to ####
- I thought the latest pattern was to use
gluon.data.vision.MNIST
andgluon.data.DataLoader
Not show stoppers, but I'd verify about the gluon pattern in case you need to "upgrade" this tutorial first.
docs/faq/distributed_training.md
Outdated
In this document, we describe how it works, how to launch a distributed training job and | ||
some environment variables which provide more control. | ||
|
||
## Type of parallelism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types of parallelism
docs/faq/distributed_training.md
Outdated
The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other. | ||
|
||
#### KV Store | ||
MXNet provides a key-value store, which is a critical component used for multi-device and distributed training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both multi-machine and multi-device are distributed training. The wording here is a bit confusing. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
One possible suggestion is we can use multi-device (which by default would mean multiple devices on a single machine) and multi-machine training. Both fall under the umbrella of distributed training. I'm assuming there is no separate concept of multi-devices on multi-machines training (i,e. host1-gpu0, host1-gpu1, host2-gpu0, host2-gpu1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about?
Worker Communication via the KVStore API
MXNet provides a key-value store, which is a critical component used in distributed training. The communication of parameters across processors on a single machine, as well across multiple machines, is relayed through one or more servers with a key-value store for the parameters. MXNet's KVStore API is the interface for workers to communicate with the one or more servers holding the key-value store. {now start with...} Workers push
gradients...
I'm left wondering though - what's the worker-to-device mapping (is an 8-gpu system one worker or 8)? Do the params get rolled up on a node with multiple gpus before going to the server? Or does each device count as a worker and talks to a server directly? Also, since there seems to be support for multiple servers, it's not clear if the workers talk to all of them or if there's something else going on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-haibin-lin @pracheer Ya, multi-device seems appropriate
@aaronmarkham I've modified the paragraph as per your suggestion and added a line about what happens when there are multiple devices on a single machine. I've also moved up the section which describes the distribution of keys on servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this reads really well.
Really few minor nitpicks.
docs/faq/distributed_training.md
Outdated
Currently, MXNet supports Model parallelism in a single machine only. Refer [Training with multiple GPUs using model parallelism](https://mxnet.incubator.apache.org/versions/master/how_to/model_parallel_lstm.html) for more on this. | ||
|
||
## How does distributed training work? | ||
The architecture of distributed training in MXNet is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: "architecture" word here is a bit off since where you go on from there are more like concepts involved in distributed training. So do you think "concepts involved in distributed training in MXNet are as follows" is more appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'The following concepts are key to understanding distributed training in MXNet'
WDYT?
docs/faq/distributed_training.md
Outdated
The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other. | ||
|
||
#### KV Store | ||
MXNet provides a key-value store, which is a critical component used for multi-device and distributed training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
One possible suggestion is we can use multi-device (which by default would mean multiple devices on a single machine) and multi-machine training. Both fall under the umbrella of distributed training. I'm assuming there is no separate concept of multi-devices on multi-machines training (i,e. host1-gpu0, host1-gpu1, host2-gpu0, host2-gpu1)?
docs/faq/index.md
Outdated
* [How can I train with multiple CPU/GPUs with data parallelism?](http://mxnet.io/how_to/multi_devices.html) | ||
* [How can I train with multiple CPU/GPUs on a single machine with data parallelism?](http://mxnet.io/how_to/multi_devices.html) | ||
|
||
* [How can I train using multiple machines with data parallelism?](http://mxnet.io/how_to/distributed_training.html) | ||
|
||
* [How can I train with multiple GPUs with model parallelism?](http://mxnet.io/how_to/model_parallel_lstm.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably nitpick: multi GPUs on a single machine or multiple machines or it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think multiple machines is okay
docs/faq/distributed_training.md
Outdated
When the array size is bigger than this threshold, `MXNET_KVSTORE_REDUCTION_NTHREADS` threads are used for reduction. | ||
This parameter is also used as a load balancer in kvstore. | ||
It controls when to partition a single weight to all the servers. | ||
If the size of a single weight is less than this bound, then it is sent to a single randomly picked server; otherwise, it is partitioned to all the servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean single weight matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, changed
docs/faq/distributed_training.md
Outdated
- `PS_VERBOSE=1` logs connection information like the IPs and ports of all nodes | ||
- `PS_VERBOSE=2` logs all data communication information | ||
|
||
When the network is unreliable, messages being sent from one node to another might get lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really minor nitpick: Giving another additional line before this make paragraph starts may make it more obvious the para is associated with the 2 variables PS_RESEND and PS_RESEND_TIMEOUT that follow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
docs/faq/distributed_training.md
Outdated
See [environment variables](#environment-variables) for more details. | ||
|
||
#### Gradient compression | ||
When communication cost is expensive, and the ratio of computation time to communication time is low, communication can become a bottleneck. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nitpick: When communication cost is expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
docs/faq/distributed_training.md
Outdated
An easy way to set up a cluster of EC2 instances for distributed deep learning is by using the [AWS CloudFormation template](https://github.com/awslabs/deeplearning-cfn). | ||
If you can not use the above, this section will help you manually set up a cluster of instances | ||
to enable you to use `ssh` for launching a distributed training job. | ||
Let us denote one machine as the `master` of the cluster, through which we will launch and monitor the distributed training on all machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really really really really nitpicky: cluster , through (no comma)
any updates on this? |
Addressed comments and updated the PR |
@aaronmarkham @pracheer is this good to go? |
@rahul003 could you fix the conflicts? |
Fixed |
Any content about how to profile the time of operator execution and communication and how well they are overlapped? |
Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E) We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id Thanks! |
@TaoLv Sorry I missed your comment. You can profile the worker process similar to a single machine case.
Note the use of rank above to ensure that the path to save profile should be different for different workers. There you can look for the operators KVStore Push/Pull to see time taken for communication. I'll add a proper section for profiling to the tutorial once #9933 is merged. That makes it easy to profile the server processes too. |
@rahul003 Thanks for your reply. I will try that. Seems some links in this tutorial are broken: Another minor suggestion, I think it would be better if you can add some pictures to show the architecture or scalability of parameter sever. |
@rahul003 Another question, is distributed training also supported by other frontend languages? Seems this tutorial only describes how to run it with python. |
@rahul003 ping |
Honestly, I've not used other language bindings, and that can be something we can add in the future. I think we can merge this in. |
@piiswrong can we merge this ? |
* initial draft * WIP tutorial * WIP tutorial, with mnist script changes Signed-off-by: Rahul <rahulhuilgol@gmail.com> * WIP tutorial, with mnist script changes Signed-off-by: Rahul <rahulhuilgol@gmail.com> * use logger Signed-off-by: Rahul <rahulhuilgol@gmail.com> * remove from old page Signed-off-by: Rahul <rahulhuilgol@gmail.com> * first draft of tutorial and removing pythonpath inserts for get_data, by moving them to test_utils Signed-off-by: Rahul <rahulhuilgol@gmail.com> * fix typos Signed-off-by: Rahul <rahulhuilgol@gmail.com> * rename functions Signed-off-by: Rahul <rahulhuilgol@gmail.com> * small change in section heading Signed-off-by: Rahul <rahulhuilgol@gmail.com> * fix reimport Signed-off-by: Rahul <rahulhuilgol@gmail.com> * Update distributed_training.md * Update distributed_training.md Punctuation and minor changes * fix gluon iterators and address some review comments Signed-off-by: Rahul <rahulhuilgol@gmail.com> * Update multi_devices.md * Update distributed_training.md indentation change * Update distributed_training.md cmake instruction * retain only doc changes * comments addressed * fix link of gradient compression page * clarifying launch.py usage * update env var info * update broken links * update comment on splitting data
* initial draft * WIP tutorial * WIP tutorial, with mnist script changes Signed-off-by: Rahul <rahulhuilgol@gmail.com> * WIP tutorial, with mnist script changes Signed-off-by: Rahul <rahulhuilgol@gmail.com> * use logger Signed-off-by: Rahul <rahulhuilgol@gmail.com> * remove from old page Signed-off-by: Rahul <rahulhuilgol@gmail.com> * first draft of tutorial and removing pythonpath inserts for get_data, by moving them to test_utils Signed-off-by: Rahul <rahulhuilgol@gmail.com> * fix typos Signed-off-by: Rahul <rahulhuilgol@gmail.com> * rename functions Signed-off-by: Rahul <rahulhuilgol@gmail.com> * small change in section heading Signed-off-by: Rahul <rahulhuilgol@gmail.com> * fix reimport Signed-off-by: Rahul <rahulhuilgol@gmail.com> * Update distributed_training.md * Update distributed_training.md Punctuation and minor changes * fix gluon iterators and address some review comments Signed-off-by: Rahul <rahulhuilgol@gmail.com> * Update multi_devices.md * Update distributed_training.md indentation change * Update distributed_training.md cmake instruction * retain only doc changes * comments addressed * fix link of gradient compression page * clarifying launch.py usage * update env var info * update broken links * update comment on splitting data
Description
This PR adds a tutorial for distributed training.
It covers
Direct link:
docs/faq/distributed_training.md
I split the old PR to separate doc change and code change.
Doc change is in this PR.
The example referred to in this tutorial requires this PR to go in
Checklist
Essentials
N/A Doc changes only
make lint
)Changes
Comments
This is unnecessary and bad user experience IMHO�