Skip to content

[Bug]: The "sorted(gpu_ids)" operation in ray_gpu_executor.py causes an incorrect order of GPU IDs When using the NVIDIA HGX A100 (16-GPU) platform for model inference.  #5590

@JiantaoXu

Description

@JiantaoXu

Your current environment

Unable to obtain environmental information at the moment.

🐛 Describe the bug

In the code vllm/executor/ray_gpu_executor.py:line 142, if the number of GPUs on a node exceeds 10 (such as NVIDIA HGX A100 with 16-GPU), the result of sorted(gpu_ids) would be 0,10,11,12,13,14,15,2,3,4,5,6,7,8,9, instead of 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15. This results in an NCCL Error, because the order of GPUs in the Ray Executor (lexicographical order) is inconsistent with the order of GPUs in NCCL (actual numerical order).
The correct way should be
node_gpus[node_id] = sorted(gpu_ids, key=lambda x: int(x))

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions