Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367

leleamol · 2018-01-10T01:34:33Z

Description

This change limits the test_nccl.py script to run 8 GPUs only. There is a but in NCCL 2.1 that causes segmentation fault when the test is run on 16 GPUs.

Checklist

Essentials

[y ] Passed code style checking (make lint)
[y] Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

leleamol · 2018-01-10T01:35:37Z

@ptrendx I have updated this tests so that we can run the tests on P2 instances until we have fix in NCCL.

marcoabreu · 2018-01-12T00:11:30Z

Fyi, we only have 2 GPUs installed in CI. The instance type is G3.8xlarge

marcoabreu · 2018-01-12T00:12:40Z

tests/python/gpu/test_nccl.py

+if num_gpus > 8 :
+    num_gpus = 8;
+
+gpus = range(1,1+num_gpus)


Please print a warning into the log so we're actively getting remembered as soon as we switch to new instances.

Thanks @marcoabreu for the review! I have added the print statement. The output of the test is given below.
Regarding machine with 2 gpus, the test caps to 8 gpus only if the machine has more than 8 gpus. The test should pass on machine with 2 gpus

Test output:

gpu]$ python test_nccl.py
The machine has 16 gpus. We will run the test on 8 gpus only due to a bug in NCCL 2.1.
Please remove this limitation when issue #9004 is fixed.
Passed

Great, thanks a lot for the detailed explanation and incorporating the change!

ptrendx · 2018-01-13T00:07:44Z

Hi, sorry for replying late.

So there is a little bit of misunderstanding there it seems. The fix to the bug in NCCL 2.1 you are referring to will not make this test work in p2 instance with 16 GPUs (even though the actual NCCL integration works well in that scenario). It should only make the test fail gracefully instead of segfaulting.

There is a limit to the number of P2P peers any given GPU may have (it is a hardware limit for all PCI-E hardware, not just GPUs) - it is 8. NCCL under normal circumstances works fine with p2 instances with 16 GPUs under the same PCI-E tree because it uses a ring for communication, so each GPU establishes P2P access only between its neighbors in a ring.

So why is the test broken then? Because (being a test) it does a sweep over multiple configurations (in this case ring sizes), going from 1 to N (= number of GPUs in a node). During this sweep it produces rings as follows:
0
0 - 1 (gpu 0 opens P2P connection to GPU 1)
0 - 1 - 2 (gpu 0 opens P2P connection to GPU 1 and 2 - it is a ring)
0 - 1 - 2 - 3 (gpu 0 opens P2P connection to GPU 1 and 3)
...
As you can see, during the execution of the for loop over test cases, GPU 0 opens P2P connections to GPUs 1,2,3,... and at some point reaches its limit of 8 peers (during ring 1 - .... - 9) and then fails during the 1 - ... - 10 ring test case. This is expected behavior (unfortunately).

To fully fix this test (which you can enable right now, btw, you don't need to wait for the next version of NCCL to do that) you would need to either:

split the test into 2 very similar tests - 1 covering range from 1 to 8, and second from 9 to 16 (which is very simple, just copy the test_nccl.py file and in the second copy remove limiter on num_gpus and do gpus=range(9, 1 + num_gpus) instead of gpus=range(1, 1 + num_gpus))
make the test choose gpus assignment to ring smarter, like this (with 16 gpus):
0
0 - 1
15 - 0 - 1
15 - 0 - 1 - 2
14 - 15 - 0 - 1 - 2
...
which would ensure that each gpu opens only up to 4 p2p connections.

Does this clarify the confusion? @leleamol

leleamol · 2018-01-13T00:55:35Z

@ptrendx Thanks for the explanation. Before sending out this code for review, I had tried to invoke the test_pushpull() function in 2 batches viz range(1..8) for 0th gpu and range (9..16) for 15th gpu. However, the test core dumps when it tries to process the second batch.
Having 2 scripts for higher number of gpus seems cumbersome if we want to put them in automation.
We can continue with limiting the test to 8 GPUS only. I will modify the print statement to indicate that it is a hardware limitation.
Does it sound ok?

Here is how temporarily modified my script to invoke the "for" loop in 2 batches. It core dumps in second loop.

m_gpus = range(1,1+num_gpus)

#@unittest.skip("Test requires NCCL library installed and enabled during build")
def test_nccl_pushpull(gpus):
    for shape, key in zip(shapes, keys):
        for n_gpus in gpus:
            kv_nccl = mx.kv.create('nccl')
            a = mx.nd.ones(shape, mx.gpu(0))
            cur_key = str(key*max(gpus)+n_gpus)
            kv_nccl.init(cur_key, a)
            arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
            res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
            kv_nccl.push(cur_key, arr_list)
            kv_nccl.pull(cur_key, res)
            for x in range(n_gpus):
                assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
    print("First Passed")
    gpus = range(9,16)
    for shape, key in zip(shapes, keys):
        for n_gpus in gpus:
            kv_nccl = mx.kv.create('nccl')
            a = mx.nd.ones(shape, mx.gpu(15))
            cur_key = str(key*max(gpus)+n_gpus)
            kv_nccl.init(cur_key, a)
            arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
            res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
            kv_nccl.push(cur_key, arr_list)
            kv_nccl.pull(cur_key, res)
            for x in range(n_gpus):
                assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
    print ("Passed")

if __name__ == '__main__':
    test_nccl_pushpull(m_gpus)

1) fixed viz routines to be compatible with current symbol json. 2) fixed two flaky tests.

…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore

…he#9404) * use a more stable formula to calculate sigmoid_bce and logistic_loss * fix docstring

eric-haibin-lin

Could you rebase and trigger the CI again?

…tor-mxnet into test-nccl-update

leleamol · 2018-01-15T18:28:37Z

Closing this pull request as it is including the changes that I haven't made as a result of rebase process.

marcoabreu reviewed Jan 12, 2018

View reviewed changes

leleamol added 2 commits January 12, 2018 15:55

Limit the test to run on 8 GPUs only until NCCL2.1 issue is fixed.

08e7555

Added the print statement

eebba66

leleamol force-pushed the test-nccl-update branch from 4d0c210 to eebba66 Compare January 13, 2018 00:10

szha and others added 6 commits January 13, 2018 13:54

resubmit apache#8763 apache#9394 to contrib (apache#9406)

e4957c3

Fixed memory leak (apache#9400)

b1ad161

bitrot fix. (apache#9414)

171d717

1) fixed viz routines to be compatible with current symbol json. 2) fixed two flaky tests.

Use a better formula to calculate sigmoid_bce and logistic_loss (apac…

134b205

…he#9404) * use a more stable formula to calculate sigmoid_bce and logistic_loss * fix docstring

move concurrent/identity blocks to contrib (apache#9427)

6dcb0eb

eric-haibin-lin reviewed Jan 15, 2018

View reviewed changes

leleamol added 3 commits January 15, 2018 10:16

Limit the test to run on 8 GPUs only until NCCL2.1 issue is fixed.

ab464ea

Added the print statement

e4db5e8

Merge branch 'test-nccl-update' of https://github.com/leleamol/incuba…

6eb2a20

…tor-mxnet into test-nccl-update

leleamol requested a review from sergeykolychev as a code owner January 15, 2018 18:22

leleamol closed this Jan 15, 2018

leleamol mentioned this pull request Jan 15, 2018

Limit the test_nccl.py to run on 8 GPUs #9441

Merged

7 tasks

leleamol deleted the test-nccl-update branch January 15, 2018 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367

Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367

leleamol commented Jan 10, 2018

leleamol commented Jan 10, 2018

marcoabreu commented Jan 12, 2018 •

edited

Loading

marcoabreu Jan 12, 2018

leleamol Jan 12, 2018

marcoabreu Jan 13, 2018

ptrendx commented Jan 13, 2018

leleamol commented Jan 13, 2018 •

edited

Loading

eric-haibin-lin left a comment

leleamol commented Jan 15, 2018

Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367

Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367

Conversation

leleamol commented Jan 10, 2018

Description

Checklist

Essentials

Changes

Comments

leleamol commented Jan 10, 2018

marcoabreu commented Jan 12, 2018 • edited Loading

marcoabreu Jan 12, 2018

Choose a reason for hiding this comment

leleamol Jan 12, 2018

Choose a reason for hiding this comment

marcoabreu Jan 13, 2018

Choose a reason for hiding this comment

ptrendx commented Jan 13, 2018

leleamol commented Jan 13, 2018 • edited Loading

eric-haibin-lin left a comment

Choose a reason for hiding this comment

leleamol commented Jan 15, 2018

marcoabreu commented Jan 12, 2018 •

edited

Loading

leleamol commented Jan 13, 2018 •

edited

Loading