-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Limit the test_nccl to run on 8 GPUs only until NCCL2.1 issue is fixed. #9367
Conversation
@ptrendx I have updated this tests so that we can run the tests on P2 instances until we have fix in NCCL. |
Fyi, we only have 2 GPUs installed in CI. The instance type is G3.8xlarge |
tests/python/gpu/test_nccl.py
Outdated
if num_gpus > 8 : | ||
num_gpus = 8; | ||
|
||
gpus = range(1,1+num_gpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please print a warning into the log so we're actively getting remembered as soon as we switch to new instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marcoabreu for the review! I have added the print statement. The output of the test is given below.
Regarding machine with 2 gpus, the test caps to 8 gpus only if the machine has more than 8 gpus. The test should pass on machine with 2 gpus
Test output:
gpu]$ python test_nccl.py
The machine has 16 gpus. We will run the test on 8 gpus only due to a bug in NCCL 2.1.
Please remove this limitation when issue #9004 is fixed.
Passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot for the detailed explanation and incorporating the change!
Hi, sorry for replying late. So there is a little bit of misunderstanding there it seems. The fix to the bug in NCCL 2.1 you are referring to will not make this test work in p2 instance with 16 GPUs (even though the actual NCCL integration works well in that scenario). It should only make the test fail gracefully instead of segfaulting. There is a limit to the number of P2P peers any given GPU may have (it is a hardware limit for all PCI-E hardware, not just GPUs) - it is 8. NCCL under normal circumstances works fine with p2 instances with 16 GPUs under the same PCI-E tree because it uses a ring for communication, so each GPU establishes P2P access only between its neighbors in a ring. So why is the test broken then? Because (being a test) it does a sweep over multiple configurations (in this case ring sizes), going from 1 to N (= number of GPUs in a node). During this sweep it produces rings as follows: To fully fix this test (which you can enable right now, btw, you don't need to wait for the next version of NCCL to do that) you would need to either:
Does this clarify the confusion? @leleamol |
4d0c210
to
eebba66
Compare
@ptrendx Thanks for the explanation. Before sending out this code for review, I had tried to invoke the test_pushpull() function in 2 batches viz range(1..8) for 0th gpu and range (9..16) for 15th gpu. However, the test core dumps when it tries to process the second batch. Here is how temporarily modified my script to invoke the "for" loop in 2 batches. It core dumps in second loop. m_gpus = range(1,1+num_gpus)
#@unittest.skip("Test requires NCCL library installed and enabled during build")
def test_nccl_pushpull(gpus):
for shape, key in zip(shapes, keys):
for n_gpus in gpus:
kv_nccl = mx.kv.create('nccl')
a = mx.nd.ones(shape, mx.gpu(0))
cur_key = str(key*max(gpus)+n_gpus)
kv_nccl.init(cur_key, a)
arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
kv_nccl.push(cur_key, arr_list)
kv_nccl.pull(cur_key, res)
for x in range(n_gpus):
assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
print("First Passed")
gpus = range(9,16)
for shape, key in zip(shapes, keys):
for n_gpus in gpus:
kv_nccl = mx.kv.create('nccl')
a = mx.nd.ones(shape, mx.gpu(15))
cur_key = str(key*max(gpus)+n_gpus)
kv_nccl.init(cur_key, a)
arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
kv_nccl.push(cur_key, arr_list)
kv_nccl.pull(cur_key, res)
for x in range(n_gpus):
assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
print ("Passed")
if __name__ == '__main__':
test_nccl_pushpull(m_gpus) |
1) fixed viz routines to be compatible with current symbol json. 2) fixed two flaky tests.
…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore
…he#9404) * use a more stable formula to calculate sigmoid_bce and logistic_loss * fix docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase and trigger the CI again?
Closing this pull request as it is including the changes that I haven't made as a result of rebase process. |
Description
This change limits the test_nccl.py script to run 8 GPUs only. There is a but in NCCL 2.1 that causes segmentation fault when the test is run on 16 GPUs.
Checklist
Essentials
make lint
)Changes
Comments