Skip to content

[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda #19579

Open
@divakar-amd

Description

@divakar-amd

Your current environment

The output of python collect_env.py
ROCM Version                 : 6.3.42133-1b9c17779
vLLM Version                 : 0.9.1.dev325+g9d880f594 (git sha: 9d880f594)
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

The size of piecewise cudagraph is much higher on rocm (mi300) than on cuda (h100). See table below. Also, this issue seems to be specific to piecewise capture; when doing a fullgraph capture on rocm, the graph size is fine.

Note: The issue is Not related to rccl/all_reduce etc. because the captured sizes below use TP=1

Instructions to reproduce the issue:

Engine init logs contain the graph captured size. e.g:
VLLM_USE_V1=1 python examples/offline_inference/basic/generate.py

INFO 06-12 19:49:27 [gpu_model_runner.py:2051] Graph capturing finished in 38 secs, took 6.32 GiB
Model (V1 engine) rocm cuda
Llama-2-7b-hf 2.97 GiB 0.61 GiB
Llama-2-70b-hf 6.32 GiB 1.35 GiB

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    To triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions