Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-37] tutorial for distributed training #9152

Merged
merged 29 commits into from
Apr 10, 2018
Merged

Conversation

rahul003
Copy link
Member

@rahul003 rahul003 commented Dec 20, 2017

Description

This PR adds a tutorial for distributed training.
It covers

  • how distributed training works
  • setting up a cluster
  • how to run distributed training using launch.py and different options it supports
  • how to run distributed training without launch.py
  • related environment variables

Direct link:
docs/faq/distributed_training.md

I split the old PR to separate doc change and code change.
Doc change is in this PR.
The example referred to in this tutorial requires this PR to go in

Checklist

Essentials

N/A Doc changes only

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Removed old section in multi_devices.md to a new page for distributed_training because it is long

Comments

  • The example this refers to requires this PR to be merged. The reason for this is that the example currently uses a python path insert to fetch a file from tests/ folder. This causes few problems
    1. It requires the user to have the whole mxnet repository, with the directory structure intact
    2. It means the user can't run the example standalone
    3. For distributed training this means we would have to synchronize the whole mxnet directory across all machines.
      This is unnecessary and bad user experience IMHO�

Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
and removing pythonpath inserts for get_data, by moving them to test_utils

Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Punctuation and minor changes
and address some review comments

Signed-off-by: Rahul <rahulhuilgol@gmail.com>
Copy link
Contributor

@aaronmarkham aaronmarkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nits:

  • Use Capitals for Major Words in Your Titles Like This
  • Try to keep the headings down to ### or #### at most. You shouldn't really jump from ## to ####
  • I thought the latest pattern was to use gluon.data.vision.MNIST and gluon.data.DataLoader

Not show stoppers, but I'd verify about the gluon pattern in case you need to "upgrade" this tutorial first.

In this document, we describe how it works, how to launch a distributed training job and
some environment variables which provide more control.

## Type of parallelism
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types of parallelism

The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other.

#### KV Store
MXNet provides a key-value store, which is a critical component used for multi-device and distributed training.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both multi-machine and multi-device are distributed training. The wording here is a bit confusing. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

One possible suggestion is we can use multi-device (which by default would mean multiple devices on a single machine) and multi-machine training. Both fall under the umbrella of distributed training. I'm assuming there is no separate concept of multi-devices on multi-machines training (i,e. host1-gpu0, host1-gpu1, host2-gpu0, host2-gpu1)?

Copy link
Contributor

@aaronmarkham aaronmarkham Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about?

Worker Communication via the KVStore API

MXNet provides a key-value store, which is a critical component used in distributed training. The communication of parameters across processors on a single machine, as well across multiple machines, is relayed through one or more servers with a key-value store for the parameters. MXNet's KVStore API is the interface for workers to communicate with the one or more servers holding the key-value store. {now start with...} Workers push gradients...

I'm left wondering though - what's the worker-to-device mapping (is an 8-gpu system one worker or 8)? Do the params get rolled up on a node with multiple gpus before going to the server? Or does each device count as a worker and talks to a server directly? Also, since there seems to be support for multiple servers, it's not clear if the workers talk to all of them or if there's something else going on...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin @pracheer Ya, multi-device seems appropriate

@aaronmarkham I've modified the paragraph as per your suggestion and added a line about what happens when there are multiple devices on a single machine. I've also moved up the section which describes the distribution of keys on servers

Copy link
Contributor

@pracheer pracheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this reads really well.
Really few minor nitpicks.

Currently, MXNet supports Model parallelism in a single machine only. Refer [Training with multiple GPUs using model parallelism](https://mxnet.incubator.apache.org/versions/master/how_to/model_parallel_lstm.html) for more on this.

## How does distributed training work?
The architecture of distributed training in MXNet is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "architecture" word here is a bit off since where you go on from there are more like concepts involved in distributed training. So do you think "concepts involved in distributed training in MXNet are as follows" is more appropriate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'The following concepts are key to understanding distributed training in MXNet'

WDYT?

The scheduler then lets all processes know about every other node in the cluster, so that they can communicate with each other.

#### KV Store
MXNet provides a key-value store, which is a critical component used for multi-device and distributed training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

One possible suggestion is we can use multi-device (which by default would mean multiple devices on a single machine) and multi-machine training. Both fall under the umbrella of distributed training. I'm assuming there is no separate concept of multi-devices on multi-machines training (i,e. host1-gpu0, host1-gpu1, host2-gpu0, host2-gpu1)?

* [How can I train with multiple CPU/GPUs with data parallelism?](http://mxnet.io/how_to/multi_devices.html)
* [How can I train with multiple CPU/GPUs on a single machine with data parallelism?](http://mxnet.io/how_to/multi_devices.html)

* [How can I train using multiple machines with data parallelism?](http://mxnet.io/how_to/distributed_training.html)

* [How can I train with multiple GPUs with model parallelism?](http://mxnet.io/how_to/model_parallel_lstm.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably nitpick: multi GPUs on a single machine or multiple machines or it doesn't matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multiple machines is okay

When the array size is bigger than this threshold, `MXNET_KVSTORE_REDUCTION_NTHREADS` threads are used for reduction.
This parameter is also used as a load balancer in kvstore.
It controls when to partition a single weight to all the servers.
If the size of a single weight is less than this bound, then it is sent to a single randomly picked server; otherwise, it is partitioned to all the servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean single weight matrix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, changed

- `PS_VERBOSE=1` logs connection information like the IPs and ports of all nodes
- `PS_VERBOSE=2` logs all data communication information

When the network is unreliable, messages being sent from one node to another might get lost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really minor nitpick: Giving another additional line before this make paragraph starts may make it more obvious the para is associated with the 2 variables PS_RESEND and PS_RESEND_TIMEOUT that follow it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

See [environment variables](#environment-variables) for more details.

#### Gradient compression
When communication cost is expensive, and the ratio of computation time to communication time is low, communication can become a bottleneck.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpick: When communication cost is expensive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

An easy way to set up a cluster of EC2 instances for distributed deep learning is by using the [AWS CloudFormation template](https://github.com/awslabs/deeplearning-cfn).
If you can not use the above, this section will help you manually set up a cluster of instances
to enable you to use `ssh` for launching a distributed training job.
Let us denote one machine as the `master` of the cluster, through which we will launch and monitor the distributed training on all machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really really really really nitpicky: cluster , through (no comma)

@piiswrong
Copy link
Contributor

any updates on this?

@rahul003
Copy link
Member Author

Addressed comments and updated the PR

@eric-haibin-lin
Copy link
Member

@aaronmarkham @pracheer is this good to go?

@rahul003 rahul003 requested a review from szha as a code owner January 26, 2018 18:20
@eric-haibin-lin
Copy link
Member

@rahul003 could you fix the conflicts?

@rahul003
Copy link
Member Author

Fixed

@TaoLv
Copy link
Member

TaoLv commented Jan 31, 2018

Any content about how to profile the time of operator execution and communication and how well they are overlapped?

@CodingCat
Copy link
Contributor

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

@rahul003
Copy link
Member Author

rahul003 commented Mar 7, 2018

@TaoLv Sorry I missed your comment. You can profile the worker process similar to a single machine case.

mx.profiler.profiler_set_config(mode='all', filename= str(kv.rank) + 'profile_output.json')
mx.profiler.profiler_set_state('run')
    # Code to be profiled goes here...
mx.profiler.profiler_set_state('stop')

Note the use of rank above to ensure that the path to save profile should be different for different workers.

There you can look for the operators KVStore Push/Pull to see time taken for communication.

I'll add a proper section for profiling to the tutorial once #9933 is merged. That makes it easy to profile the server processes too.

@rahul003 rahul003 changed the title tutorial for distributed training [MXNET-37] tutorial for distributed training Mar 7, 2018
@TaoLv
Copy link
Member

TaoLv commented Mar 8, 2018

@rahul003 Thanks for your reply. I will try that. Seems some links in this tutorial are broken:
Training with multiple GPUs using model parallelism
Using data from S3 for training

Another minor suggestion, I think it would be better if you can add some pictures to show the architecture or scalability of parameter sever.

@TaoLv
Copy link
Member

TaoLv commented Mar 22, 2018

@rahul003 Another question, is distributed training also supported by other frontend languages? Seems this tutorial only describes how to run it with python.

@piiswrong
Copy link
Contributor

@rahul003 ping

@rahul003
Copy link
Member Author

rahul003 commented Apr 3, 2018

Honestly, I've not used other language bindings, and that can be something we can add in the future. I think we can merge this in.

@anirudh2290
Copy link
Member

@piiswrong can we merge this ?

@piiswrong piiswrong merged commit 24362d0 into apache:master Apr 10, 2018
rahul003 added a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* initial draft

* WIP tutorial

* WIP tutorial, with mnist script changes

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* WIP tutorial, with mnist script changes

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* use logger

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* remove from old page

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* first draft of tutorial
and removing pythonpath inserts for get_data, by moving them to test_utils

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* fix typos

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* rename functions

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* small change in section heading

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* fix reimport

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* Update distributed_training.md

* Update distributed_training.md

Punctuation and minor changes

* fix gluon iterators
and address some review comments

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* Update multi_devices.md

* Update distributed_training.md

indentation change

* Update distributed_training.md

cmake instruction

* retain only doc changes

* comments addressed

* fix link of gradient compression page

* clarifying launch.py usage

* update env var info

* update broken links

* update comment on splitting data
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* initial draft

* WIP tutorial

* WIP tutorial, with mnist script changes

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* WIP tutorial, with mnist script changes

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* use logger

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* remove from old page

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* first draft of tutorial
and removing pythonpath inserts for get_data, by moving them to test_utils

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* fix typos

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* rename functions

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* small change in section heading

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* fix reimport

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* Update distributed_training.md

* Update distributed_training.md

Punctuation and minor changes

* fix gluon iterators
and address some review comments

Signed-off-by: Rahul <rahulhuilgol@gmail.com>

* Update multi_devices.md

* Update distributed_training.md

indentation change

* Update distributed_training.md

cmake instruction

* retain only doc changes

* comments addressed

* fix link of gradient compression page

* clarifying launch.py usage

* update env var info

* update broken links

* update comment on splitting data
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants