Open
Description
Your current environment
The output of python collect_env.py
ROCM Version : 6.3.42133-1b9c17779
vLLM Version : 0.9.1.dev325+g9d880f594 (git sha: 9d880f594)
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
The size of piecewise cudagraph is much higher on rocm (mi300) than on cuda (h100). See table below. Also, this issue seems to be specific to piecewise capture; when doing a fullgraph capture on rocm, the graph size is fine.
Note: The issue is Not related to rccl/all_reduce etc. because the captured sizes below use TP=1
Instructions to reproduce the issue:
Engine init logs contain the graph captured size. e.g:
VLLM_USE_V1=1 python examples/offline_inference/basic/generate.py
INFO 06-12 19:49:27 [gpu_model_runner.py:2051] Graph capturing finished in 38 secs, took 6.32 GiB
Model (V1 engine) | rocm | cuda |
---|---|---|
Llama-2-7b-hf | 2.97 GiB | 0.61 GiB |
Llama-2-70b-hf | 6.32 GiB | 1.35 GiB |
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
To triage