Skip to content

[Bug][Rocm] Garbage Response from vLLM When Using Tensor Parallelism on AMD CPX/NPS4 Partitioned GPUs #20125

Open
@Bihan

Description

@Bihan

Your current environment

The output of python collect_env.py
I have attached output file.

vllm_collect_env_output.txt

🐛 Describe the bug

Steps to reproduce:
We referred to doc: Steps to Run a vLLM Workload on AMD partition.

  • Do CPS/NPS4 Partition
    sudo amd-smi set --memory-partition NPS4

  • Launch container
    docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri rocm/vllm:latest /bin/bash

  • Set Env

export HF_TOKEN=<token>
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  • vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 8

  • Query the model

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  --data '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Deep Learning?"}
    ]
  }'

Actual behaviour:
Garbled and nonsensical output as below

{"id":"chatcmpl-5488b13e1910409d884196f041b34b0b","object":"chat.completion","created":1750923599,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Deep célibai8://%) the:// Bon the:// Bachelor the://

дост:// False capital://{ progress:// Barb n程�程-disable","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":128001}],"usage":{"prompt_tokens":46,"total_tokens":4602,"completion_tokens":4556,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

Expected behaviour:
Meaningful Response

Additional Information:

  1. vLLM works as expected with a single CPX partition.
    How to reproduce:
  • Launch container with only 1 CPX partition (/dev/dri/renderD128 )

docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device=/dev/dri/renderD128 rocm/vllm:latest /bin/bash

  • vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048

  • Query the model as above.

  1. This likely isn't a ROCm or hardware issue, but possibly a vLLM backend issue. We did some sanity checks and Validation to rule out issues from Hardware, ROCm, PyTorch, RCCL:
  • Vanilla PyTorch matrix multiplication and all-reduce operation gives the correct result.
    Here every process creates a 2x2 matrix on its own GPU, performs matrix multiplication,and then performs an all-reduce operation, which communicates the results across all GPUs. We verified the calculation and it is correct. This means there should not be any issues in accuracy in distributed setup.

  • Launch process on each partition using mpirun
    mpirun --allow-run-as-root
    -np 1 -x ROCR_VISIBLE_DEVICES=0 python mpi_test.py
    -np 1 -x ROCR_VISIBLE_DEVICES=1 python mpi_test.py
    Each MPI rank sees only the partition assigned to it via ROCR_VISIBLE_DEVICES. This further reinforces that partition isolation is functioning.

  • Finally conducted rccl-tests with 8 partitions and this works too.

./build/all_reduce_perf -b 2M -e 8M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 2097152 maxBytes 8388608 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#

rccl-tests: Version develop:b0a3841
# Using devices
#  Rank  0 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  0 [0000:1b:00] AMD Instinct MI300X
#  Rank  1 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  1 [0000:1b:00] AMD Instinct MI300X
#  Rank  2 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  2 [0000:1b:00] AMD Instinct MI300X
#  Rank  3 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  3 [0000:1b:00] AMD Instinct MI300X
#  Rank  4 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  4 [0000:1b:00] AMD Instinct MI300X
#  Rank  5 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  5 [0000:1b:00] AMD Instinct MI300X
#  Rank  6 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  6 [0000:1b:00] AMD Instinct MI300X
#  Rank  7 Group  0 Pid   1158 on ENC1-CLS01-SVR07 device  7 [0000:1b:00] AMD Instinct MI300X
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     2097152        524288     float     sum      -1    75.02   27.95   48.92      0    72.64   28.87   50.52      0
     4194304       1048576     float     sum      -1    93.79   44.72   78.26      0    89.37   46.93   82.13      0
     8388608       2097152     float     sum      -1    165.7   50.62   88.59      0    159.6   52.57   92.00      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 73.4045

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrocmRelated to AMD ROCm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions