Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
Improving docs & utilities for distributed training example. (#3341)
Browse files Browse the repository at this point in the history
  • Loading branch information
leopd authored and mli committed Sep 20, 2016
1 parent 6267856 commit 992a6e2
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 4 deletions.
3 changes: 3 additions & 0 deletions docs/how_to/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,9 @@ various distributed filesystem such as HDFS/Amazon S3/...
#### Building with Intel MKL Support
First, `source /path/to/intel/bin/compilervars.sh` to automatically set environment variables. Then, edit [make/config.mk](../../make/config.mk), let `USE_BLAS = mkl`. `USE_INTEL_PATH = NONE` is usually not necessary to be modified.

#### Building for distributed training
To be able to run distributed training jobs, the `USE_DIST_KVSTORE=1` flag must be set. This enables a distributed
key-value store needed to share parameters between multiple machines working on training the same neural network.

## Python Package Installation

Expand Down
4 changes: 2 additions & 2 deletions docs/how_to/multi_devices.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ can use a large batch size for multiple GPUs.

> To use GPUs, we need to compiled MXNet with GPU support. For
> example, set `USE_CUDA=1` in `config.mk` before `make`. (see
> [build](../get_started/build.html) for more options).
> [MXNet installation guide](build.html) for more options).
If a machine has one or more than one GPU cards installed, then each card is
labeled by a number starting from 0. To use a particular GPU, one can often
Expand Down Expand Up @@ -131,7 +131,7 @@ information about these two data consistency models.
### How to Launch a Job

> To use distributed training, we need to compile with `USE_DIST_KVSTORE=1`
> (see [build](../get_started/build.html) for more options).
> (see [MXNet installation guide](build.html) for more options).
Launching a distributed job is little bit different than running on a single
machine. MXNet provides
Expand Down
2 changes: 1 addition & 1 deletion example/image-classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ We can train a model using multiple machines.
```

See more launch options, e.g. by `Yarn`, and how to write a distributed training
program on this [tutorial](http://mxnet.readthedocs.org/en/latest/distributed_training.html)
program on this [tutorial](http://mxnet.readthedocs.io/en/latest/how_to/multi_devices.html)

### Predict

Expand Down
7 changes: 6 additions & 1 deletion tools/launch.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,12 @@ def dmlc_opts(opts):
'--host-file', opts.hostfile,
'--sync-dst-dir', opts.sync_dst_dir]
args += opts.command;
from dmlc_tracker import opts
try:
from dmlc_tracker import opts
except ImportError:
print("Can't load dmlc_tracker package. Perhaps you need to run")
print(" git submodule update --init --recursive")
raise
dmlc_opts = opts.get_opts(args)
return dmlc_opts

Expand Down

0 comments on commit 992a6e2

Please sign in to comment.