[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #1

ctcyang · 2018-07-06T20:13:04Z

Description

Single machine All Reduce Topology-aware Communication

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Proposed communication method shows speed-up compared to both existing methods (parameter server and NCCL) on small batch sizes for ResNet-50, VGG-16, Inception-v3 and AlexNet.
Communication method queries the single-machine multi-GPU link topology, and determines a suitable communication pattern to use.
Feature can be activated by setting environmental variable MXNET_KVSTORE_USETREE=1. This feature is turned off by default.
In future, will add auto-tuner to automatically choose between single-machine communication protocols (parameter server, NCCL, method proposed here).

Comments

Design Proposal
There are reviewer comments in this PR here: [MXNET-331] Single machine All Reduce Topology-aware Communication apache/mxnet#11357

…ight when num_gpus <= 8

…bator-mxnet into feature_multirootv9

…to feature_multirootv9

…oblem, add header guard

* Fix Flaky Test - Change float to double to increase the precision of numbers generated in scala

* batchnorm fall back. * add test * fix.

* Update test_kvstore_gpu.py * Update test_kvstore_gpu.py

Increase API Level to 27 and update NDK to 17b

…1436) * handle the case that inputs and outputs of a graph share NDArrays * add test. * test multiple times. * don't change the state's array list. * retrigger

* Compile with LAPACK shared library * Add missing bracket * Reorder LAPACK paths

…ameters (#11527)

…zation (#10429) * give warning for variables with same name in graph visualization * fix line too long * print repetead node names * update warning and unit test * add assert for repeated node * add graphviz for arm * update docker install * skip unittest if graphviz could not be imported * optimize imports

…se PCI-E as fallback for GPUs that are not linked by NVLink

* Test input a graph. * Update foreach to execute the subgraph. * print inputs/outputs in foreach. * Remove print. * add test code for foreach. * exec foreach outside the engine. * Implements forward of foreach. * Add support for variable numbers of inputs and outputs. * Add a python wrapper for foreach. * Fix the order of inputs. * add test with lstm. * hide C version of foreach. * fix a bug temporarily. * Test free variables. * change for the new interface of InputGraph attribute. * Add attribute to the subgraph. * Handle free variables. * Get all input symbols of a subgraph. * Fix shape, dtype and storage inference. * reorganize the output of foreach. * Add a gluon RNN unroll with symbol foreach. * print unnecessary print. * have imperative and symbolic foreach. * Fix an error after moving foreach. * Fix imperative foreach * Fix a minor problem. * Use CachedOp to execute subgraph. * update TODO. * make foreach op use FStatefulComputeEx. TODO we need to change stateful executor to handle subgraph. * Add backward. * Fix bugs. * enable backward test in lstm. * Fix a bug in foreach backward for free variables. * change for the new CachedOp. * Detect the backward computation. * Fix bugs in foreach. * fix tests. * update tests. * check state shape. * enable nested foreach. * remove print. * fix a bug in test. * handle infer storage type for backward. * address comments. * address comments. * move some common functions out. * address comments. * fix lint. * Fix lint. * add doc. * undo modification in imperative.h * add doc and remove example code. * fix lint. * fix lint. * Fix lint. * make nd.foreach and sym.foreach consistent. * fix compile error. * address comments. * update. * check for loop only works for dense arrays. * move control flow op out of nn/ * fix include. * add a test in gluon. * work for GPU. * small fix. * remove subgraph_name * create loop state for reuse in the future. * move code. * Revert "remove subgraph_name" This reverts commit 977f5624ad0b0dedb9dcb8629f975afc56bb1e1a. * cut graph. * rename new var nodes. * Fix tests. * Fix bugs caused by ctypes (#29) * Add save/load json in testcases for foreach (#30) * support subgraph in stateful executor. * Fix compilation. * fix a bug when a subgraph has variable nodes. * Fix a bug of getting symbols. * copy var nodes. * Fix getting op states. * fix lint error. * address comments. * fix lint error. * simplify the execution of subgraph in the main thread. * fix lint error. * avoid waiting for computation in each iteration. * reuse cached op for inference. * share memory across mini-batches. * reuse memory. reuse memory between iterations in inference. reuse memory between mini-batches in training. * add tests for multiple batches. * remove entry. * add benchmark for foreach. * benchmark large batch size. * Fix the benchmark for GPU. * address comments. * update shape/dtype/storage inference. * update contrib API docs. * support nested foreach. * use a single CachedOp for all iterations. * use large dim. * update benchmark. * update benchmark. * update benchmark. * update benchmark. * return symbol arrays correctly in MXSymbolCutSubgraph. * return symbol arrays in MXSymbolGetInputSymbols. * fix lint error. * use cachedop to infer storage in backward. * fix scala API. * update comments. * fix scala. * fix test. * fix attribute name. * move benchmark. * fix the mapping of operator inputs/outputs and subgraph inputs/outputs. * add tests for dtype/shape inference. * reorganize tests. * fix a bug of cutting NodeEntry. When two node entries refer to the same output of a node, we should create only one var node for these two node entries. * fix lint error. * handle the case that outputs are inputs. * handle the case that inputs aren't used. * handle the case without output data. * fix a bug in foreach backward. * fix a bug when there isn't output data. * Fix lint error. * test diff Gluon RNN cells. * test all symbol RNN cells. * adjust the test precision. * Fix a bug in getting a list of variable names. We can't get a list of variable names from a hashtable. The order can't be guaranteed. Python2 and Python3 output different orders. * fix lint error. * Test 1D array. * fix a bug when subgraph inputs and outputs share NDArray. * fix. * fix * add comments.

* use stable_sort to make it consistent between cpu/gpu * fix testing logic * fix * remove comments

…time (AOT) (#11534) - Do not check in generated files anymore - Add to gitignore

…#11503) * updating installation info to have latest packages and more clarity * fix table * using image for the table * removed ubuntu 12 mention

@marcoabreu

* Remove 'vi ci-test.sh' file * Add the scripts to reformat the style with cljfmt - Add lein-cljfmt-check to each project - Add lein-cljfmt-fix to each project * Add lein-cljfmt to the plugins vector * Add steps to keep style consistent to README.md * Run lein-cljfmt-fix on the main src/test codes - Run $MXROOT/contrib/clojure-package/lein-cljfmt-fix * Run lein cljfmt fix on the example projects - Run $MXNET_ROOT/contrib/clojure-package/examples/lein-cljfmt-fix * Use only one script in the base directory. - Thanks to @marcoabreu for the suggestion/review * Minor update to kick off the new build

* Create Interface for Symbol and NDArray APIs, enable JavaDoc jar building for Scala Package.

* Change wget into java download

* add gan base file and example suite

* Maven demo project

I have checked that one doesn't need to compile mxnet from source to enable profiling. Installing 1.2.0 version of mxnet is enough to run the code listed in that notebook. I checked it for both CPU and GPU contexts.

* Added Learning Rate Finder tutorial. * Updated based on feedback. * Reverting save_parameters changes. * Adjusted based on feedback. * Corrected outdated code comment.

* Manually check node existence in CachedOp * Fix lint * Trigger CI * Improve readability, replace `numeric_limits::max` with `kEidNotExist` * Add testcase * Trigger CI * Remove commented lines in unittests * Trigger CI

Carl Yang and others added 30 commits June 4, 2018 03:51

add multiroot all-reduce communication pattern

9678143

fix bug with UpdateWeight

d5e51d6

fix PCI-E links appearing in weight matrix bug

0708dbc

optimization to skip CopyFromTo in ReduceInner gains a bit of throughput

5590920

remove unnecessary if statement

4f8f58b

Add tests

908534a

add more tests, 6 tests left to add

25cbbdc

get rid of some dead code

310ee4d

Add comments

9cce8ea

Add randomized tests for backtrack and kernighan-lin

4d2790d

Fix Postprocess

b5b42bc

Add switch for first valid tree when num_gpus > 8, and for maximum we…

6327ceb

…ight when num_gpus <= 8

Kernighan-Lin seems to find better trees

8694fe7

get rid of printfs

c6cd67a

change defaults

7466c4d

Merge branch 'feature_multirootv9' of https://github.com/ctcyang/incu…

153ec0b

…bator-mxnet into feature_multirootv9

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

7c61b6c

…to feature_multirootv9

inherit from CommDevice instead of Comm

cc935a2

Fix lint errors

ba60aaa

Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation pr…

972e9c0

…oblem, add header guard

fix lint errors

6627dcf

better header guard that works for tests

4de89a7

get rid of unused variable warning

317c66b

retrigger jenkins

c364fd3

Update CODEOWNERS (#11416)

1d9096f

Disable flaky test test_sequence_last (#11485)

7ce746f

[MXNET-610] Disable flaky test test_sample_multinomial (#11488)

2509e0c

[MXNET-564] Fix the Flaky test on arange (#11377)

c84a2f6

* Fix Flaky Test - Change float to double to increase the precision of numbers generated in scala

[MXNET-603] Disable flaky test test_hybrid_static_memory (#11486)

4511086

Fix typo in Dockerfile description. (#11354)

e4575ea

zheng-da and others added 29 commits July 2, 2018 00:02

batchnorm falls back for sparse arrays. (#11473)

c5035b6

* batchnorm fall back. * add test * fix.

Re-enable kvstore test (#11519)

568cbaf

* Update test_kvstore_gpu.py * Update test_kvstore_gpu.py

Disable test_pooling_versions (#11518)

84b9b41

Add redundant R key servers. (#11489)

cb4d310

Add fallback for codecov (#11524)

d34452b

Fix #11504. Division by zero in callback with very fast speed (#11521)

6bdd922

[MXNET-57] Android ARMv7 support (#11382)

555dc30

Increase API Level to 27 and update NDK to 17b

handle the case that inputs and outputs of a graph share NDArrays (#1…

d6a8ca7

…1436) * handle the case that inputs and outputs of a graph share NDArrays * add test. * test multiple times. * don't change the state's array list. * retrigger

Compile with LAPACK shared library (#11483)

99ac2f5

* Compile with LAPACK shared library * Add missing bracket * Reorder LAPACK paths

Fix flaky test 10380 (#11529)

490e951

Replace save_params and load_params with save_parameters and load_par…

defd544

…ameters (#11527)

address comment using Class to do test, get rid of extraneous test, u…

bd926bf

…se PCI-E as fallback for GPUs that are not linked by NVLink

resolve merge conflicts

0e1a704

Update block.py (#11532)

0ae38c7

Fix Flaky test test_multi_proposal_op as in #10186 (#11530)

34fafb4

* use stable_sort to make it consistent between cpu/gpu * fix testing logic * fix * remove comments

Move the auto-generation for the ndarray and symbol files to compile …

e94146f

…time (AOT) (#11534) - Do not check in generated files anymore - Add to gitignore

updating installation info to have latest packages and more clarity (…

552c715

…#11503) * updating installation info to have latest packages and more clarity * fix table * using image for the table * removed ubuntu 12 mention

Update block.py (#11540)

5df5f2e

[MXNET-319] Javadoc fix (#11239)

815eb60

* Create Interface for Symbol and NDArray APIs, enable JavaDoc jar building for Scala Package.

[MXNET-565]change wget into java download (#11383)

01f1457

* Change wget into java download

[MXNET-531] GAN MNIST Examples for Scala new API (#11547)

92f0c51

* add gan base file and example suite

[MXNET-531] Maven demo project (#11451)

9e681f2

* Maven demo project

Profiler is already enabled by default in 1.2.0 (#11558)

eb4b66c

I have checked that one doesn't need to compile mxnet from source to enable profiling. Installing 1.2.0 version of mxnet is enough to run the code listed in that notebook. I checked it for both CPU and GPU contexts.

[MXNET-594] Added Learning Rate Finder tutorial (#11304)

e870890

* Added Learning Rate Finder tutorial. * Updated based on feedback. * Reverting save_parameters changes. * Adjusted based on feedback. * Corrected outdated code comment.

Manually check node existence in CachedOp (#11545)

19ac41d

* Manually check node existence in CachedOp * Fix lint * Trigger CI * Improve readability, replace `numeric_limits::max` with `kEidNotExist` * Add testcase * Trigger CI * Remove commented lines in unittests * Trigger CI

Merge remote-tracking branch 'apache/master' into feature_multirootv9

47b0b63

ctcyang closed this Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #1

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #1

ctcyang commented Jul 6, 2018

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #1

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #1

Conversation

ctcyang commented Jul 6, 2018

Description

Checklist

Essentials

Changes

Comments