Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367

Closed
wants to merge 11 commits into from

Conversation

leleamol
Copy link
Contributor

Description

This change limits the test_nccl.py script to run 8 GPUs only. There is a but in NCCL 2.1 that causes segmentation fault when the test is run on 16 GPUs.

Checklist

Essentials

  • [y ] Passed code style checking (make lint)
  • [y] Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@leleamol
Copy link
Contributor Author

@ptrendx I have updated this tests so that we can run the tests on P2 instances until we have fix in NCCL.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jan 12, 2018

Fyi, we only have 2 GPUs installed in CI. The instance type is G3.8xlarge

if num_gpus > 8 :
num_gpus = 8;

gpus = range(1,1+num_gpus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please print a warning into the log so we're actively getting remembered as soon as we switch to new instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marcoabreu for the review! I have added the print statement. The output of the test is given below.
Regarding machine with 2 gpus, the test caps to 8 gpus only if the machine has more than 8 gpus. The test should pass on machine with 2 gpus

Test output:

gpu]$ python test_nccl.py
The machine has 16 gpus. We will run the test on 8 gpus only due to a bug in NCCL 2.1.
Please remove this limitation when issue #9004 is fixed.
Passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks a lot for the detailed explanation and incorporating the change!

@ptrendx
Copy link
Member

ptrendx commented Jan 13, 2018

Hi, sorry for replying late.

So there is a little bit of misunderstanding there it seems. The fix to the bug in NCCL 2.1 you are referring to will not make this test work in p2 instance with 16 GPUs (even though the actual NCCL integration works well in that scenario). It should only make the test fail gracefully instead of segfaulting.

There is a limit to the number of P2P peers any given GPU may have (it is a hardware limit for all PCI-E hardware, not just GPUs) - it is 8. NCCL under normal circumstances works fine with p2 instances with 16 GPUs under the same PCI-E tree because it uses a ring for communication, so each GPU establishes P2P access only between its neighbors in a ring.

So why is the test broken then? Because (being a test) it does a sweep over multiple configurations (in this case ring sizes), going from 1 to N (= number of GPUs in a node). During this sweep it produces rings as follows:
0
0 - 1 (gpu 0 opens P2P connection to GPU 1)
0 - 1 - 2 (gpu 0 opens P2P connection to GPU 1 and 2 - it is a ring)
0 - 1 - 2 - 3 (gpu 0 opens P2P connection to GPU 1 and 3)
...
As you can see, during the execution of the for loop over test cases, GPU 0 opens P2P connections to GPUs 1,2,3,... and at some point reaches its limit of 8 peers (during ring 1 - .... - 9) and then fails during the 1 - ... - 10 ring test case. This is expected behavior (unfortunately).

To fully fix this test (which you can enable right now, btw, you don't need to wait for the next version of NCCL to do that) you would need to either:

  • split the test into 2 very similar tests - 1 covering range from 1 to 8, and second from 9 to 16 (which is very simple, just copy the test_nccl.py file and in the second copy remove limiter on num_gpus and do gpus=range(9, 1 + num_gpus) instead of gpus=range(1, 1 + num_gpus))
  • make the test choose gpus assignment to ring smarter, like this (with 16 gpus):
    0
    0 - 1
    15 - 0 - 1
    15 - 0 - 1 - 2
    14 - 15 - 0 - 1 - 2
    ...
    which would ensure that each gpu opens only up to 4 p2p connections.

Does this clarify the confusion? @leleamol

@leleamol
Copy link
Contributor Author

leleamol commented Jan 13, 2018

@ptrendx Thanks for the explanation. Before sending out this code for review, I had tried to invoke the test_pushpull() function in 2 batches viz range(1..8) for 0th gpu and range (9..16) for 15th gpu. However, the test core dumps when it tries to process the second batch.
Having 2 scripts for higher number of gpus seems cumbersome if we want to put them in automation.
We can continue with limiting the test to 8 GPUS only. I will modify the print statement to indicate that it is a hardware limitation.
Does it sound ok?

Here is how temporarily modified my script to invoke the "for" loop in 2 batches. It core dumps in second loop.

m_gpus = range(1,1+num_gpus)

#@unittest.skip("Test requires NCCL library installed and enabled during build")
def test_nccl_pushpull(gpus):
    for shape, key in zip(shapes, keys):
        for n_gpus in gpus:
            kv_nccl = mx.kv.create('nccl')
            a = mx.nd.ones(shape, mx.gpu(0))
            cur_key = str(key*max(gpus)+n_gpus)
            kv_nccl.init(cur_key, a)
            arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
            res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
            kv_nccl.push(cur_key, arr_list)
            kv_nccl.pull(cur_key, res)
            for x in range(n_gpus):
                assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
    print("First Passed")
    gpus = range(9,16)
    for shape, key in zip(shapes, keys):
        for n_gpus in gpus:
            kv_nccl = mx.kv.create('nccl')
            a = mx.nd.ones(shape, mx.gpu(15))
            cur_key = str(key*max(gpus)+n_gpus)
            kv_nccl.init(cur_key, a)
            arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
            res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
            kv_nccl.push(cur_key, arr_list)
            kv_nccl.pull(cur_key, res)
            for x in range(n_gpus):
                assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
    print ("Passed")

if __name__ == '__main__':
    test_nccl_pushpull(m_gpus)

szha and others added 6 commits January 13, 2018 13:54
1) fixed viz routines to be compatible with current symbol json.
2) fixed two flaky tests.
…che#8732)

* comm device for rsp push and pull

* update

* update test

* optimization for same row_ids

* add stream->wait

* remove using space

* fix race of rsc and extend ElementwiseSum to rsp cases

* add log fatal in ElementwiseSum

* direct copy rows if full rsp and put all outputs on ctx of src

* trigger

* fix

* simplify copy

* move check same rowids to utils and add test for same rowids case

* remove direct copy row by row

* fix checkSameRowid

* gpu unique impl draft

* unique

* update

* fix windows build

* trigger windows build

* support single rowid with multiple vals

* address comments

* check same row_ids and copy in fronted

* revise names and disable test for local kvstore
…he#9404)

* use a more stable formula to calculate sigmoid_bce and logistic_loss

* fix docstring
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase and trigger the CI again?

@leleamol
Copy link
Contributor Author

Closing this pull request as it is including the changes that I haven't made as a result of rebase process.

@leleamol leleamol closed this Jan 15, 2018
@leleamol leleamol deleted the test-nccl-update branch January 15, 2018 19:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants