Improve multi-GPU performance (#3241)

* update kvstore * update model.py * bandwith tool * update readme * tiny * fix lint * fix batch size of dist_device_sync * fix * fix perf problem of kvstore when only using a single device * roll back to previous strategy how to choose update_on_kvsotre * add an optionl MXNET_ENABLE_GPU_P2P to control whether or not use p2p
apache · Sep 13, 2016 · 9dfb354 · 9dfb354
1 parent f14ce88
commit 9dfb354
Show file tree

Hide file tree

Showing 12 changed files with 780 additions and 347 deletions.
diff --git a/docs/how_to/env_var.md b/docs/how_to/env_var.md
@@ -3,6 +3,8 @@ Environment Variables
 MXNet have several settings that can be changed via environment variable.
 Usually you do not need to change these settings, but they are listed here for reference.
 
+## Set the number of threads
+
 * MXNET_GPU_WORKER_NTHREADS (default=2)
   - Maximum number of threads that do the computation job on each GPU.
 * MXNET_GPU_COPY_NTHREADS (default=1)
@@ -11,6 +13,9 @@ Usually you do not need to change these settings, but they are listed here for r
   - Maximum number of threads that do the CPU computation job.
 * MXNET_CPU_PRIORITY_NTHREADS (default=4)
 	- Number of threads given to prioritized CPU jobs.
+
+## Memory options
+
 * MXNET_EXEC_ENABLE_INPLACE (default=true)
   - Whether to enable inplace optimization in symbolic execution.
 * MXNET_EXEC_MATCH_RANGE (default=10)
@@ -20,17 +25,29 @@ Usually you do not need to change these settings, but they are listed here for r
   - Maximum number of temp workspace we can allocate to each device.
   - Set this to small number can save GPU memory.
   - It will also likely to decrease level of parallelism, which is usually OK.
+
+## Engine type
+
 * MXNET_ENGINE_TYPE (default=ThreadedEnginePerDevice)
   - The type of underlying execution engine of MXNet.
   - List of choices
     - NaiveEngine: very simple engine that use master thread to do computation.
     - ThreadedEngine: a threaded engine that uses global thread pool to schedule jobs.
     - ThreadedEnginePerDevice: a threaded engine that allocates thread per GPU.
+
+## Control the data communication
+
 * MXNET_KVSTORE_REDUCTION_NTHREADS (default=4)
-	- Number of threads used for summing of big arrays.
+	- Number of CPU threads used for summing of big arrays.
 * MXNET_KVSTORE_BIGARRAY_BOUND (default=1e6)
 	- The minimum size of "big array".
 	- When the array size is bigger than this threshold, MXNET_KVSTORE_REDUCTION_NTHREADS threads will be used for reduction.
+* MXNET_ENABLE_GPU_P2P (default=1)
+    - If true, mxnet will try to use GPU peer-to-peer communication if available
+      when kvstore's type is `device`
+
+## Others
+
 * MXNET_CUDNN_AUTOTUNE_DEFAULT (default=0)
     - The default value of cudnn_tune for convolution layers.
     - Auto tuning is turn off by default. Set to 1 to turn on by default for benchmarking.
@@ -45,4 +62,3 @@ Settings for More GPU Parallelism
 - Set ```MXNET_GPU_WORKER_NTHREADS``` to larger number (e.g. 2)
   - You may want to set ```MXNET_EXEC_NUM_TEMP``` to reduce memory usage.
 - This may not speed things up, especially for image applications, because GPU is usually fully utilized even with serialized jobs.
-
diff --git a/python/mxnet/model.py b/python/mxnet/model.py
@@ -47,7 +47,7 @@ def _create_kvstore(kvstore, num_device, arg_params):
     arg_params : dict of str to NDArray
         Model parameter, dict of name to NDArray of net's weights.
     """
-
+    update_on_kvstore = True
     if kvstore is None:
         kv = None
     elif isinstance(kvstore, kvs.KVStore):
@@ -58,21 +58,17 @@ def _create_kvstore(kvstore, num_device, arg_params):
             # no need to use kv for single device and single machine
             kv = None
         else:
-            if kvstore is 'local':
-                # automatically select a proper local
-                max_size = max(np.prod(param.shape) for param in arg_params.values())
-                if max_size < 1024 * 1024 * 16:
-                    kvstore = 'local_update_cpu'
-                else:
-                    kvstore = 'local_allreduce_cpu'
-                logging.info('Auto-select kvstore type = %s', kvstore)
             kv = kvs.create(kvstore)
+            if kvstore is 'local':
+            # automatically select a proper local
+                max_size = max(np.prod(param.shape) for param in
+                               arg_params.values())
+                if max_size > 1024 * 1024 * 16:
+                    update_on_kvstore = False
     else:
         raise TypeError('kvstore must be KVStore, str or None')
 
-    # detect whether or not update weight on kvstore
-    update_on_kvstore = True
-    if not kv or 'local_allreduce' in kv.type:
+    if kv is None:
         update_on_kvstore = False
 
     return (kv, update_on_kvstore)
@@ -765,7 +761,7 @@ def fit(self, X, y=None, eval_data=None, eval_metric='acc',
         # init optmizer
         if isinstance(self.optimizer, str):
             batch_size = data.batch_size
-            if kvstore and kvstore.type == 'dist_sync':
+            if kvstore and 'dist' in kvstore.type and not '_async' in kvstore.type:
                 batch_size *= kvstore.num_workers
             optimizer = opt.create(self.optimizer,
                                    rescale_grad=(1.0/batch_size),