[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
ROCM Version                 : 6.3.42133-1b9c17779
vLLM Version                 : 0.9.1.dev325+g9d880f594 (git sha: 9d880f594)
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

```
</details>





### 🐛 Describe the bug

The size of piecewise cudagraph is much higher on rocm (mi300) than on cuda (h100). See table below. Also, this issue seems to be specific to piecewise capture; when doing a fullgraph capture on rocm, the graph size is fine.

**Note**: The issue is Not related to rccl/all_reduce etc. because the captured sizes below use TP=1

#### Instructions to reproduce the issue:
Engine init logs contain the graph captured size. e.g: 
`VLLM_USE_V1=1 python examples/offline_inference/basic/generate.py`
```
INFO 06-12 19:49:27 [gpu_model_runner.py:2051] Graph capturing finished in 38 secs, took 6.32 GiB
```

| Model (V1 engine)   | rocm      | cuda     |
|---------------------|-----------|----------|
| Llama-2-7b-hf       | 2.97 GiB  | 0.61 GiB | 
| Llama-2-70b-hf      | 6.32 GiB  | 1.35 GiB |  


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda #19579

Your current environment

🐛 Describe the bug

Instructions to reproduce the issue:

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda #19579

Description

Your current environment

🐛 Describe the bug

Instructions to reproduce the issue:

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions