Improving docs & utilities for distributed training example. (#3341)

apache · Sep 20, 2016 · 992a6e2 · 992a6e2
1 parent 6267856
commit 992a6e2
Show file tree

Hide file tree

Showing 4 changed files with 12 additions and 4 deletions.
diff --git a/docs/how_to/build.md b/docs/how_to/build.md
@@ -142,6 +142,9 @@ various distributed filesystem such as HDFS/Amazon S3/...
 #### Building with Intel MKL Support
 First, `source /path/to/intel/bin/compilervars.sh` to automatically set environment variables. Then, edit [make/config.mk](../../make/config.mk), let `USE_BLAS = mkl`. `USE_INTEL_PATH = NONE` is usually not necessary to be modified.
 
+#### Building for distributed training
+To be able to run distributed training jobs, the `USE_DIST_KVSTORE=1` flag must be set.  This enables a distributed
+key-value store needed to share parameters between multiple machines working on training the same neural network.
 
 ## Python Package Installation
 

diff --git a/docs/how_to/multi_devices.md b/docs/how_to/multi_devices.md
@@ -30,7 +30,7 @@ can use a large batch size for multiple GPUs.
 
 > To use GPUs, we need to compiled MXNet with GPU support. For
 > example, set `USE_CUDA=1` in `config.mk` before `make`. (see
-> [build](../get_started/build.html) for more options).
+> [MXNet installation guide](build.html) for more options).
 
 If a machine has one or more than one GPU cards installed, then each card is
 labeled by a number starting from 0. To use a particular GPU, one can often
@@ -131,7 +131,7 @@ information about these two data consistency models.
 ### How to Launch a Job
 
 > To use distributed training, we need to compile with `USE_DIST_KVSTORE=1`
-> (see [build](../get_started/build.html) for more options).
+> (see [MXNet installation guide](build.html) for more options).
 
 Launching a distributed job is little bit different than running on a single
 machine. MXNet provides

diff --git a/example/image-classification/README.md b/example/image-classification/README.md
@@ -93,7 +93,7 @@ We can train a model using multiple machines.
   ```
 
 See more launch options, e.g. by `Yarn`, and how to write a distributed training
-program on this [tutorial](http://mxnet.readthedocs.org/en/latest/distributed_training.html)
+program on this [tutorial](http://mxnet.readthedocs.io/en/latest/how_to/multi_devices.html)
 
 ### Predict
 

diff --git a/tools/launch.py b/tools/launch.py
@@ -19,7 +19,12 @@ def dmlc_opts(opts):
             '--host-file', opts.hostfile,
             '--sync-dst-dir', opts.sync_dst_dir]
     args += opts.command;
-    from dmlc_tracker import opts
+    try:
+        from dmlc_tracker import opts
+    except ImportError:
+        print("Can't load dmlc_tracker package.  Perhaps you need to run")
+        print("    git submodule update --init --recursive")
+        raise
     dmlc_opts = opts.get_opts(args)
     return dmlc_opts