Description
Your current environment
🐛 Describe the bug
Steps to reproduce:
We referred to doc: Steps to Run a vLLM Workload on AMD partition.
-
Do CPS/NPS4 Partition
sudo amd-smi set --memory-partition NPS4
-
Launch container
docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri rocm/vllm:latest /bin/bash
-
Set Env
export HF_TOKEN=<token>
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-
vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 8
-
Query the model
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
--data '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Deep Learning?"}
]
}'
Actual behaviour:
Garbled and nonsensical output as below
{"id":"chatcmpl-5488b13e1910409d884196f041b34b0b","object":"chat.completion","created":1750923599,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Deep célibai8://%) the:// Bon the:// Bachelor the://
…
дост:// False capital://{ progress:// Barb n程�程-disable","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":128001}],"usage":{"prompt_tokens":46,"total_tokens":4602,"completion_tokens":4556,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}
Expected behaviour:
Meaningful Response
Additional Information:
- vLLM works as expected with a single CPX partition.
How to reproduce:
- Launch container with only 1 CPX partition (/dev/dri/renderD128 )
docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device=/dev/dri/renderD128 rocm/vllm:latest /bin/bash
-
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 2048
-
Query the model as above.
- This likely isn't a ROCm or hardware issue, but possibly a vLLM backend issue. We did some sanity checks and Validation to rule out issues from Hardware, ROCm, PyTorch, RCCL:
-
Vanilla PyTorch matrix multiplication and all-reduce operation gives the correct result.
Here every process creates a 2x2 matrix on its own GPU, performs matrix multiplication,and then performs an all-reduce operation, which communicates the results across all GPUs. We verified the calculation and it is correct. This means there should not be any issues in accuracy in distributed setup. -
Launch process on each partition using mpirun
mpirun --allow-run-as-root
-np 1 -x ROCR_VISIBLE_DEVICES=0 python mpi_test.py
-np 1 -x ROCR_VISIBLE_DEVICES=1 python mpi_test.py
Each MPI rank sees only the partition assigned to it via ROCR_VISIBLE_DEVICES. This further reinforces that partition isolation is functioning. -
Finally conducted rccl-tests with 8 partitions and this works too.
./build/all_reduce_perf -b 2M -e 8M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 2097152 maxBytes 8388608 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:b0a3841
# Using devices
# Rank 0 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 0 [0000:1b:00] AMD Instinct MI300X
# Rank 1 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 1 [0000:1b:00] AMD Instinct MI300X
# Rank 2 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 2 [0000:1b:00] AMD Instinct MI300X
# Rank 3 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 3 [0000:1b:00] AMD Instinct MI300X
# Rank 4 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 4 [0000:1b:00] AMD Instinct MI300X
# Rank 5 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 5 [0000:1b:00] AMD Instinct MI300X
# Rank 6 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 6 [0000:1b:00] AMD Instinct MI300X
# Rank 7 Group 0 Pid 1158 on ENC1-CLS01-SVR07 device 7 [0000:1b:00] AMD Instinct MI300X
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
2097152 524288 float sum -1 75.02 27.95 48.92 0 72.64 28.87 50.52 0
4194304 1048576 float sum -1 93.79 44.72 78.26 0 89.37 46.93 82.13 0
8388608 2097152 float sum -1 165.7 50.62 88.59 0 159.6 52.57 92.00 0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth : 73.4045
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.