You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug]: The "sorted(gpu_ids)" operation in ray_gpu_executor.py causes an incorrect order of GPU IDs When using the NVIDIA HGX A100 (16-GPU) platform for model inference. #5590
Unable to obtain environmental information at the moment.
🐛 Describe the bug
In the code vllm/executor/ray_gpu_executor.py:line 142, if the number of GPUs on a node exceeds 10 (such as NVIDIA HGX A100 with 16-GPU), the result of sorted(gpu_ids) would be 0,10,11,12,13,14,15,2,3,4,5,6,7,8,9, instead of 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15. This results in an NCCL Error, because the order of GPUs in the Ray Executor (lexicographical order) is inconsistent with the order of GPUs in NCCL (actual numerical order).
The correct way should be
node_gpus[node_id] = sorted(gpu_ids, key=lambda x: int(x))