[MXNET-37] tutorial for distributed training #9152

rahul003 · 2017-12-20T03:03:51Z

Description

This PR adds a tutorial for distributed training.
It covers

how distributed training works
setting up a cluster
how to run distributed training using launch.py and different options it supports
how to run distributed training without launch.py
related environment variables

Direct link:
docs/faq/distributed_training.md

I split the old PR to separate doc change and code change.
Doc change is in this PR.
The example referred to in this tutorial requires this PR to go in

Checklist

Essentials

N/A Doc changes only

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Removed old section in multi_devices.md to a new page for distributed_training because it is long

Comments

The example this refers to requires this PR to be merged. The reason for this is that the example currently uses a python path insert to fetch a file from tests/ folder. This causes few problems
1. It requires the user to have the whole mxnet repository, with the directory structure intact
2. It means the user can't run the example standalone
3. For distributed training this means we would have to synchronize the whole mxnet directory across all machines.
  This is unnecessary and bad user experience IMHO�

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

and removing pythonpath inserts for get_data, by moving them to test_utils Signed-off-by: Rahul <rahulhuilgol@gmail.com>

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

Punctuation and minor changes

and address some review comments Signed-off-by: Rahul <rahulhuilgol@gmail.com>

indentation change

cmake instruction

aaronmarkham

Style nits:

Use Capitals for Major Words in Your Titles Like This
Try to keep the headings down to ### or #### at most. You shouldn't really jump from ## to ####
I thought the latest pattern was to use gluon.data.vision.MNIST and gluon.data.DataLoader

Not show stoppers, but I'd verify about the gluon pattern in case you need to "upgrade" this tutorial first.

eric-haibin-lin · 2017-12-21T22:55:49Z

docs/faq/distributed_training.md

+In this document, we describe how it works, how to launch a distributed training job and
+some environment variables which provide more control.
+
+## Type of parallelism


Types of parallelism

eric-haibin-lin · 2017-12-21T22:59:10Z

docs/faq/distributed_training.md

+The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other.
+
+#### KV Store
+MXNet provides a key-value store, which is a critical component used for multi-device and distributed training.


I think both multi-machine and multi-device are distributed training. The wording here is a bit confusing. What do you think?

+1.

One possible suggestion is we can use multi-device (which by default would mean multiple devices on a single machine) and multi-machine training. Both fall under the umbrella of distributed training. I'm assuming there is no separate concept of multi-devices on multi-machines training (i,e. host1-gpu0, host1-gpu1, host2-gpu0, host2-gpu1)?

What about?

Worker Communication via the KVStore API

MXNet provides a key-value store, which is a critical component used in distributed training. The communication of parameters across processors on a single machine, as well across multiple machines, is relayed through one or more servers with a key-value store for the parameters. MXNet's KVStore API is the interface for workers to communicate with the one or more servers holding the key-value store. {now start with...} Workers push gradients...

I'm left wondering though - what's the worker-to-device mapping (is an 8-gpu system one worker or 8)? Do the params get rolled up on a node with multiple gpus before going to the server? Or does each device count as a worker and talks to a server directly? Also, since there seems to be support for multiple servers, it's not clear if the workers talk to all of them or if there's something else going on...

@eric-haibin-lin @pracheer Ya, multi-device seems appropriate

@aaronmarkham I've modified the paragraph as per your suggestion and added a line about what happens when there are multiple devices on a single machine. I've also moved up the section which describes the distribution of keys on servers

pracheer

In general, this reads really well.
Really few minor nitpicks.

pracheer · 2017-12-22T02:33:39Z

docs/faq/distributed_training.md

+Currently, MXNet supports Model parallelism in a single machine only. Refer [Training with multiple GPUs using model parallelism](https://mxnet.incubator.apache.org/versions/master/how_to/model_parallel_lstm.html) for more on this.
+
+## How does distributed training work?
+The architecture of distributed training in MXNet is as follows:


nitpick: "architecture" word here is a bit off since where you go on from there are more like concepts involved in distributed training. So do you think "concepts involved in distributed training in MXNet are as follows" is more appropriate?

'The following concepts are key to understanding distributed training in MXNet'

WDYT?

pracheer · 2017-12-22T02:33:55Z

docs/faq/distributed_training.md

+The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other.
+
+#### KV Store
+MXNet provides a key-value store, which is a critical component used for multi-device and distributed training.


+1.

One possible suggestion is we can use multi-device (which by default would mean multiple devices on a single machine) and multi-machine training. Both fall under the umbrella of distributed training. I'm assuming there is no separate concept of multi-devices on multi-machines training (i,e. host1-gpu0, host1-gpu1, host2-gpu0, host2-gpu1)?

pracheer · 2017-12-22T02:35:26Z

docs/faq/index.md

-* [How can I train with multiple CPU/GPUs with data parallelism?](http://mxnet.io/how_to/multi_devices.html)
+* [How can I train with multiple CPU/GPUs on a single machine with data parallelism?](http://mxnet.io/how_to/multi_devices.html)
+
+* [How can I train using multiple machines with data parallelism?](http://mxnet.io/how_to/distributed_training.html)

 * [How can I train with multiple GPUs with model parallelism?](http://mxnet.io/how_to/model_parallel_lstm.html)


probably nitpick: multi GPUs on a single machine or multiple machines or it doesn't matter?

I think multiple machines is okay

pracheer · 2017-12-22T02:44:23Z

docs/faq/distributed_training.md

+  When the array size is bigger than this threshold, `MXNET_KVSTORE_REDUCTION_NTHREADS` threads are used for reduction.
+  This parameter is also used as a load balancer in kvstore.
+  It controls when to partition a single weight to all the servers.
+  If the size of a single weight is less than this bound, then it is sent to a single randomly picked server; otherwise, it is partitioned to all the servers.


you mean single weight matrix?

ya, changed

pracheer · 2017-12-22T02:50:06Z

docs/faq/distributed_training.md

+    - `PS_VERBOSE=1` logs connection information like the IPs and ports of all nodes
+    - `PS_VERBOSE=2` logs all data communication information
+
+When the network is unreliable, messages being sent from one node to another might get lost.


really minor nitpick: Giving another additional line before this make paragraph starts may make it more obvious the para is associated with the 2 variables PS_RESEND and PS_RESEND_TIMEOUT that follow it.

pracheer · 2017-12-22T02:56:19Z

docs/faq/distributed_training.md

+See [environment variables](#environment-variables) for more details.
+
+#### Gradient compression
+When communication cost is expensive, and the ratio of computation time to communication time is low, communication can become a bottleneck.


minor nitpick: When communication ~~cost~~ is expensive

pracheer · 2017-12-22T03:02:22Z

docs/faq/distributed_training.md

+An easy way to set up a cluster of EC2 instances for distributed deep learning is by using the [AWS CloudFormation template](https://github.com/awslabs/deeplearning-cfn).
+If you can not use the above, this section will help you manually set up a cluster of instances
+to enable you to use `ssh` for launching a distributed training job.
+Let us denote one machine as the `master` of the cluster, through which we will launch and monitor the distributed training on all machines.


really really really really nitpicky: cluster , through (no comma)

piiswrong · 2018-01-15T20:48:52Z

any updates on this?

rahul003 · 2018-01-15T21:37:24Z

Addressed comments and updated the PR

eric-haibin-lin · 2018-01-22T20:08:58Z

@aaronmarkham @pracheer is this good to go?

eric-haibin-lin · 2018-01-27T22:25:31Z

@rahul003 could you fix the conflicts?

rahul003 · 2018-01-29T19:07:57Z

Fixed

TaoLv · 2018-01-31T02:49:49Z

Any content about how to profile the time of operator execution and communication and how well they are overlapped?

CodingCat · 2018-03-07T04:45:28Z

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

rahul003 · 2018-03-07T08:17:29Z

@TaoLv Sorry I missed your comment. You can profile the worker process similar to a single machine case.

mx.profiler.profiler_set_config(mode='all', filename= str(kv.rank) + 'profile_output.json')
mx.profiler.profiler_set_state('run')
    # Code to be profiled goes here...
mx.profiler.profiler_set_state('stop')

Note the use of rank above to ensure that the path to save profile should be different for different workers.

There you can look for the operators KVStore Push/Pull to see time taken for communication.

I'll add a proper section for profiling to the tutorial once #9933 is merged. That makes it easy to profile the server processes too.

TaoLv · 2018-03-08T02:08:28Z

@rahul003 Thanks for your reply. I will try that. Seems some links in this tutorial are broken:
Training with multiple GPUs using model parallelism
Using data from S3 for training

Another minor suggestion, I think it would be better if you can add some pictures to show the architecture or scalability of parameter sever.

TaoLv · 2018-03-22T01:32:21Z

@rahul003 Another question, is distributed training also supported by other frontend languages? Seems this tutorial only describes how to run it with python.

piiswrong · 2018-03-29T19:08:21Z

@rahul003 ping

rahul003 · 2018-04-03T23:50:06Z

Honestly, I've not used other language bindings, and that can be something we can add in the future. I think we can merge this in.

anirudh2290 · 2018-04-08T18:50:09Z

@piiswrong can we merge this ?

* initial draft * WIP tutorial * WIP tutorial, with mnist script changes Signed-off-by: Rahul <rahulhuilgol@gmail.com> * WIP tutorial, with mnist script changes Signed-off-by: Rahul <rahulhuilgol@gmail.com> * use logger Signed-off-by: Rahul <rahulhuilgol@gmail.com> * remove from old page Signed-off-by: Rahul <rahulhuilgol@gmail.com> * first draft of tutorial and removing pythonpath inserts for get_data, by moving them to test_utils Signed-off-by: Rahul <rahulhuilgol@gmail.com> * fix typos Signed-off-by: Rahul <rahulhuilgol@gmail.com> * rename functions Signed-off-by: Rahul <rahulhuilgol@gmail.com> * small change in section heading Signed-off-by: Rahul <rahulhuilgol@gmail.com> * fix reimport Signed-off-by: Rahul <rahulhuilgol@gmail.com> * Update distributed_training.md * Update distributed_training.md Punctuation and minor changes * fix gluon iterators and address some review comments Signed-off-by: Rahul <rahulhuilgol@gmail.com> * Update multi_devices.md * Update distributed_training.md indentation change * Update distributed_training.md cmake instruction * retain only doc changes * comments addressed * fix link of gradient compression page * clarifying launch.py usage * update env var info * update broken links * update comment on splitting data

rahul003 added 19 commits December 14, 2017 16:25

initial draft

2d64cce

WIP tutorial

0f586c3

WIP tutorial, with mnist script changes

086c968

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

WIP tutorial, with mnist script changes

c18c0f1

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

use logger

efe6b5f

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

remove from old page

d3adec8

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

first draft of tutorial

7c0fdd1

and removing pythonpath inserts for get_data, by moving them to test_utils Signed-off-by: Rahul <rahulhuilgol@gmail.com>

fix typos

d60487d

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

rename functions

6c4e88c

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

fix paths

242417c

small change in section heading

be71a71

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

fix reimport

af97eca

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

Update distributed_training.md

189aca3

Update distributed_training.md

6dbec1f

Punctuation and minor changes

fix gluon iterators

194211d

and address some review comments Signed-off-by: Rahul <rahulhuilgol@gmail.com>

Update multi_devices.md

9298a83

Update distributed_training.md

25f980f

indentation change

Update distributed_training.md

727bdb5

cmake instruction

retain only doc changes

47913b3

rahul003 mentioned this pull request Dec 20, 2017

[OLD] tutorial for distributed training #9135

Closed

8 tasks

aaronmarkham reviewed Dec 20, 2017

View reviewed changes

eric-haibin-lin reviewed Dec 21, 2017

View reviewed changes

pracheer suggested changes Dec 22, 2017

View reviewed changes

eric-haibin-lin mentioned this pull request Dec 22, 2017

Training on multiple processes zackchase/mxnet-the-straight-dope#362

Open

comments addressed

255b7a7

fix link of gradient compression page

a7c61df

rahul003 requested a review from szha as a code owner January 26, 2018 18:20

rahul003 added 2 commits January 26, 2018 13:24

Merge branch 'master' into dist-pr2

95fbc5f

clarifying launch.py usage

fec28fa

Merge branch 'master' into dist-pr2

84fd6fc

rahul003 added 2 commits March 7, 2018 00:09

pulled from master

8a5226d

update env var info

90464e0

rahul003 changed the title ~~tutorial for distributed training~~ [MXNET-37] tutorial for distributed training Mar 7, 2018

rahul003 added 3 commits April 1, 2018 22:48

update master

d4a4aaf

update broken links

8fca6b1

update comment on splitting data

e9bcad7

piiswrong merged commit 24362d0 into apache:master Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-37] tutorial for distributed training #9152

[MXNET-37] tutorial for distributed training #9152

rahul003 commented Dec 20, 2017 •

edited

Loading

aaronmarkham left a comment

eric-haibin-lin Dec 21, 2017

eric-haibin-lin Dec 21, 2017

pracheer Dec 22, 2017

aaronmarkham Jan 5, 2018 •

edited

Loading

rahul003 Jan 15, 2018

pracheer left a comment

pracheer Dec 22, 2017

rahul003 Jan 15, 2018

pracheer Dec 22, 2017

pracheer Dec 22, 2017

rahul003 Jan 15, 2018

pracheer Dec 22, 2017

rahul003 Jan 15, 2018

pracheer Dec 22, 2017

rahul003 Jan 15, 2018

pracheer Dec 22, 2017

rahul003 Jan 15, 2018

pracheer Dec 22, 2017

piiswrong commented Jan 15, 2018

rahul003 commented Jan 15, 2018

eric-haibin-lin commented Jan 22, 2018

eric-haibin-lin commented Jan 27, 2018

rahul003 commented Jan 29, 2018

TaoLv commented Jan 31, 2018

CodingCat commented Mar 7, 2018

rahul003 commented Mar 7, 2018

TaoLv commented Mar 8, 2018 •

edited

Loading

TaoLv commented Mar 22, 2018

piiswrong commented Mar 29, 2018

rahul003 commented Apr 3, 2018

anirudh2290 commented Apr 8, 2018

[MXNET-37] tutorial for distributed training #9152

[MXNET-37] tutorial for distributed training #9152

Conversation

rahul003 commented Dec 20, 2017 • edited Loading

Description

Checklist

Essentials

Changes

Comments

aaronmarkham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronmarkham Jan 5, 2018 • edited Loading

Choose a reason for hiding this comment

Worker Communication via the KVStore API

Choose a reason for hiding this comment

pracheer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Jan 15, 2018

rahul003 commented Jan 15, 2018

eric-haibin-lin commented Jan 22, 2018

eric-haibin-lin commented Jan 27, 2018

rahul003 commented Jan 29, 2018

TaoLv commented Jan 31, 2018

CodingCat commented Mar 7, 2018

rahul003 commented Mar 7, 2018

TaoLv commented Mar 8, 2018 • edited Loading

TaoLv commented Mar 22, 2018

piiswrong commented Mar 29, 2018

rahul003 commented Apr 3, 2018

anirudh2290 commented Apr 8, 2018

rahul003 commented Dec 20, 2017 •

edited

Loading

aaronmarkham Jan 5, 2018 •

edited

Loading

TaoLv commented Mar 8, 2018 •

edited

Loading