[Bug][Rocm] Garbage Response from vLLM When Using Tensor Parallelism on AMD CPX/NPS4 Partitioned GPUs

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
I have attached output file.
```

[vllm_collect_env_output.txt](https://github.com/user-attachments/files/20926332/vllm_collect_env_output.txt)

</details>


### 🐛 Describe the bug

**Steps to reproduce:**
We referred to doc:  [Steps to Run a vLLM Workload on AMD partition](https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/run-vllm.html).
- [ ] **Do CPS/NPS4 Partition**
`sudo amd-smi set --memory-partition NPS4`


- [ ] **Launch container**
`docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri rocm/vllm:latest /bin/bash`


 - [ ] **Set Env**
```
export HF_TOKEN=<token>
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
```


- [ ] `vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 8`


- [ ] **Query the model**
```
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  --data '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Deep Learning?"}
    ]
  }'
```

**Actual behaviour:**
Garbled and nonsensical output as below

{"id":"chatcmpl-5488b13e1910409d884196f041b34b0b","object":"chat.completion","created":1750923599,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Deep célibai8://%) the:// Bon the:// Bachelor the://
…
дост:// False capital://{ progress:// Barb n程�程-disable","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":128001}],"usage":{"prompt_tokens":46,"total_tokens":4602,"completion_tokens":4556,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

**Expected behaviour:**
Meaningful Response


**Additional Information:**

1. **vLLM works as expected with a single CPX partition.**
**How to reproduce:**
 

- [ ] Launch container with only 1 CPX partition (/dev/dri/renderD128 )

`docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device=/dev/dri/renderD128  rocm/vllm:latest /bin/bash`

 

- [ ] `vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048`


- [ ] Query the model as above.
           

2. **This likely isn't a ROCm or hardware issue, but possibly a vLLM backend issue. We did some sanity checks and Validation to rule out issues from Hardware, ROCm, PyTorch, RCCL:**

- [ ] **Vanilla PyTorch matrix multiplication and all-reduce operation gives the correct result.**
Here every process creates a 2x2 matrix on its own GPU, performs matrix multiplication,and then performs an all-reduce operation, which communicates the results across all GPUs. We verified the calculation and it is correct. This means there should not be any issues in accuracy in distributed setup.

- [ ] **Launch process on each partition using mpirun**
mpirun --allow-run-as-root \
  -np 1 -x ROCR_VISIBLE_DEVICES=0 python mpi_test.py \
  -np 1 -x ROCR_VISIBLE_DEVICES=1 python mpi_test.py
Each MPI rank sees only the partition assigned to it via      ROCR_VISIBLE_DEVICES. This further reinforces that partition isolation is functioning.

- [ ] **Finally conducted rccl-tests with 8 partitions and this works too.**

```
./build/all_reduce_perf -b 2M -e 8M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 2097152 maxBytes 8388608 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#

rccl-tests: Version develop:b0a3841
# Using devices
#  Rank  0 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  0 [0000:1b:00] AMD Instinct MI300X
#  Rank  1 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  1 [0000:1b:00] AMD Instinct MI300X
#  Rank  2 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  2 [0000:1b:00] AMD Instinct MI300X
#  Rank  3 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  3 [0000:1b:00] AMD Instinct MI300X
#  Rank  4 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  4 [0000:1b:00] AMD Instinct MI300X
#  Rank  5 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  5 [0000:1b:00] AMD Instinct MI300X
#  Rank  6 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  6 [0000:1b:00] AMD Instinct MI300X
#  Rank  7 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  7 [0000:1b:00] AMD Instinct MI300X
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     2097152        524288     float     sum      -1    75.02   27.95   48.92      0    72.64   28.87   50.52      0
     4194304       1048576     float     sum      -1    93.79   44.72   78.26      0    89.37   46.93   82.13      0
     8388608       2097152     float     sum      -1    165.7   50.62   88.59      0    159.6   52.57   92.00      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 73.4045
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug][Rocm] Garbage Response from vLLM When Using Tensor Parallelism on AMD CPX/NPS4 Partitioned GPUs #20125

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug][Rocm] Garbage Response from vLLM When Using Tensor Parallelism on AMD CPX/NPS4 Partitioned GPUs #20125

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions