Description
Hi ,
I have two nodes, each node have 1 A100 + 1 CX6,
just want to try internode.py between those two nodes,
python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.3.40 --master_port=12345 tests/test_internode.py
python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.3.40 --master_port=12345 tests/test_internode.py
But hit the issue as below, seems DeepEP only can run 8 GPU setups, can you please confirm? And if I need run deepep on my setups,
just modify the macro NUM_MAX_NVL_PEERS=1 should be enough?
self.runtime = deep_ep_cpp.Buffer(self.rank, self.group_size, num_nvl_bytes, num_rdma_bytes, low_latency_mode)
RuntimeError: Failed: Assertion error /home1/DeepEP/csrc/deep_ep.cpp:32 'num_ranks > NUM_MAX_NVL_PEERS or low_latency_mode'
Looking forward to see your feedback and thanks a lot.
Qingsong