Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
Improve multi-GPU performance (#3241)
Browse files Browse the repository at this point in the history
* update kvstore

* update model.py

* bandwith tool

* update readme

* tiny

* fix lint

* fix batch size of dist_device_sync

* fix

* fix perf problem of kvstore when only using a single device

* roll back to previous strategy how to choose update_on_kvsotre

* add an optionl MXNET_ENABLE_GPU_P2P to control whether or not use p2p
  • Loading branch information
mli authored Sep 13, 2016
1 parent f14ce88 commit 9dfb354
Show file tree
Hide file tree
Showing 12 changed files with 780 additions and 347 deletions.
20 changes: 18 additions & 2 deletions docs/how_to/env_var.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ Environment Variables
MXNet have several settings that can be changed via environment variable.
Usually you do not need to change these settings, but they are listed here for reference.

## Set the number of threads

* MXNET_GPU_WORKER_NTHREADS (default=2)
- Maximum number of threads that do the computation job on each GPU.
* MXNET_GPU_COPY_NTHREADS (default=1)
Expand All @@ -11,6 +13,9 @@ Usually you do not need to change these settings, but they are listed here for r
- Maximum number of threads that do the CPU computation job.
* MXNET_CPU_PRIORITY_NTHREADS (default=4)
- Number of threads given to prioritized CPU jobs.

## Memory options

* MXNET_EXEC_ENABLE_INPLACE (default=true)
- Whether to enable inplace optimization in symbolic execution.
* MXNET_EXEC_MATCH_RANGE (default=10)
Expand All @@ -20,17 +25,29 @@ Usually you do not need to change these settings, but they are listed here for r
- Maximum number of temp workspace we can allocate to each device.
- Set this to small number can save GPU memory.
- It will also likely to decrease level of parallelism, which is usually OK.

## Engine type

* MXNET_ENGINE_TYPE (default=ThreadedEnginePerDevice)
- The type of underlying execution engine of MXNet.
- List of choices
- NaiveEngine: very simple engine that use master thread to do computation.
- ThreadedEngine: a threaded engine that uses global thread pool to schedule jobs.
- ThreadedEnginePerDevice: a threaded engine that allocates thread per GPU.

## Control the data communication

* MXNET_KVSTORE_REDUCTION_NTHREADS (default=4)
- Number of threads used for summing of big arrays.
- Number of CPU threads used for summing of big arrays.
* MXNET_KVSTORE_BIGARRAY_BOUND (default=1e6)
- The minimum size of "big array".
- When the array size is bigger than this threshold, MXNET_KVSTORE_REDUCTION_NTHREADS threads will be used for reduction.
* MXNET_ENABLE_GPU_P2P (default=1)
- If true, mxnet will try to use GPU peer-to-peer communication if available
when kvstore's type is `device`

## Others

* MXNET_CUDNN_AUTOTUNE_DEFAULT (default=0)
- The default value of cudnn_tune for convolution layers.
- Auto tuning is turn off by default. Set to 1 to turn on by default for benchmarking.
Expand All @@ -45,4 +62,3 @@ Settings for More GPU Parallelism
- Set ```MXNET_GPU_WORKER_NTHREADS``` to larger number (e.g. 2)
- You may want to set ```MXNET_EXEC_NUM_TEMP``` to reduce memory usage.
- This may not speed things up, especially for image applications, because GPU is usually fully utilized even with serialized jobs.

22 changes: 9 additions & 13 deletions python/mxnet/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def _create_kvstore(kvstore, num_device, arg_params):
arg_params : dict of str to NDArray
Model parameter, dict of name to NDArray of net's weights.
"""

update_on_kvstore = True
if kvstore is None:
kv = None
elif isinstance(kvstore, kvs.KVStore):
Expand All @@ -58,21 +58,17 @@ def _create_kvstore(kvstore, num_device, arg_params):
# no need to use kv for single device and single machine
kv = None
else:
if kvstore is 'local':
# automatically select a proper local
max_size = max(np.prod(param.shape) for param in arg_params.values())
if max_size < 1024 * 1024 * 16:
kvstore = 'local_update_cpu'
else:
kvstore = 'local_allreduce_cpu'
logging.info('Auto-select kvstore type = %s', kvstore)
kv = kvs.create(kvstore)
if kvstore is 'local':
# automatically select a proper local
max_size = max(np.prod(param.shape) for param in
arg_params.values())
if max_size > 1024 * 1024 * 16:
update_on_kvstore = False
else:
raise TypeError('kvstore must be KVStore, str or None')

# detect whether or not update weight on kvstore
update_on_kvstore = True
if not kv or 'local_allreduce' in kv.type:
if kv is None:
update_on_kvstore = False

return (kv, update_on_kvstore)
Expand Down Expand Up @@ -765,7 +761,7 @@ def fit(self, X, y=None, eval_data=None, eval_metric='acc',
# init optmizer
if isinstance(self.optimizer, str):
batch_size = data.batch_size
if kvstore and kvstore.type == 'dist_sync':
if kvstore and 'dist' in kvstore.type and not '_async' in kvstore.type:
batch_size *= kvstore.num_workers
optimizer = opt.create(self.optimizer,
rescale_grad=(1.0/batch_size),
Expand Down
Loading

0 comments on commit 9dfb354

Please sign in to comment.