You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) (#11591)
* add multiroot all-reduce communication pattern
* fix bug with UpdateWeight
* fix PCI-E links appearing in weight matrix bug
* optimization to skip CopyFromTo in ReduceInner gains a bit of throughput
* remove unnecessary if statement
* Add tests
* add more tests, 6 tests left to add
* get rid of some dead code
* Add comments
* Add randomized tests for backtrack and kernighan-lin
* Fix Postprocess
* Add switch for first valid tree when num_gpus > 8, and for maximum weight when num_gpus <= 8
* Kernighan-Lin seems to find better trees
* get rid of printfs
* change defaults
* inherit from CommDevice instead of Comm
* Fix lint errors
* Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation problem, add header guard
* fix lint errors
* better header guard that works for tests
* get rid of unused variable warning
* retrigger jenkins
* resolve 2 comments
* address comment using Class to do test, get rid of extraneous test, use PCI-E as fallback for GPUs that are not linked by NVLink
* address comments
* fix a few bugs
* get rid of printfs
* get rid of print
* Comment out test for now
* fix 2 more bugs
* fix segfault
* change PrintVector, PrintTopo, PrintMatrix to LOG(INFO) instead of stdout
* Fix code alignment
* get rid of todo
* Make changes to env variable names to indicate they are TREE-related
* Add note saying when ARRAY_BOUND env var takes effect
- When the array size is bigger than this threshold, MXNET_KVSTORE_REDUCTION_NTHREADS threads are used for reduction.
85
85
- This parameter is also used as a load balancer in kvstore. It controls when to partition a single weight to all the servers. If the size of a single weight is less than MXNET_KVSTORE_BIGARRAY_BOUND then, it is sent to a single randomly picked server otherwise it is partitioned to all the servers.
86
+
87
+
* MXNET_KVSTORE_USETREE
88
+
- Values: 0(false) or 1(true) ```(default=0)```
89
+
- If true, MXNet tries to use tree reduction for Push and Pull communication.
90
+
- Otherwise, MXNet uses the default Push and Pull implementation.
91
+
-[Tree reduction technology](http://www.sysml.cc/doc/178.pdf) has been shown to be faster than the standard ```--kv-store device``` Push/Pull and ```--kv-store nccl``` Push/Pull for small batch sizes.
92
+
93
+
* MXNET_KVSTORE_LOGTREE
94
+
- Values: 0(false) or 1(true) ```(default=0)```
95
+
- If true and MXNET_KVSTORE_USETREE is set to 1, MXNet will log the reduction trees that have been generated.
96
+
97
+
* MXNET_KVSTORE_TREE_ARRAY_BOUND
98
+
- Values: Int ```(default=10000000)```
99
+
- The minimum size of a "big array".
100
+
- When the array size is bigger than this threshold and MXNET_KVSTORE_USETREE is set to 1, multiple trees are used to load balance the big gradient being communicated in order to better saturate link bandwidth.
101
+
- Note: This environmental variable only takes effect if Tree KVStore is being used (MXNET_KVSTORE_USETREE=1).
102
+
103
+
* MXNET_KVSTORE_TREE_BACKTRACK
104
+
- Values: 0(false) or 1(true) ```(default=0)
105
+
- If true and MXNET_KVSTORE_USETREE is set to 1, MXNet tries to use backtracking to generate the trees required for tree reduction.
106
+
- If false and MXNET_KVSTORE_USETREE is set to 1, MXNet tries to use Kernighan-Lin heuristic to generate the trees required for tree reduction.
107
+
108
+
* MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY
109
+
- Values: Float ```(default=0.7)```
110
+
- The multiplicative penalty term to a link being used once.
111
+
86
112
* MXNET_ENABLE_GPU_P2P
87
113
- Values: 0(false) or 1(true) ```(default=1)```
88
114
- If true, MXNet tries to use GPU peer-to-peer communication, if available on your device,
0 commit comments