Skip to content

[UX] Add Feedback During CUDAGraph Capture #19501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 11, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import torch
import torch.distributed
import torch.nn as nn
from tqdm import tqdm

import vllm.envs as envs
from vllm.attention import AttentionType, get_attn_backend
Expand Down Expand Up @@ -2034,7 +2035,9 @@ def capture_model(self) -> None:
# can reuse the memory pool allocated for the large shapes.
with graph_capture(device=self.device):
skip_attn = not self.vllm_config.compilation_config.full_cuda_graph
for num_tokens in reversed(self.cudagraph_batch_sizes):
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the tqdm only run on the first model runner, otherwise we will have each TP/PP rank clobber each other as they increment
See how we did it in V0, although I think we should generalize this to any parallelism rather than just TP

# Only rank 0 should print progress bar during capture
if get_tensor_model_parallel_rank() == 0:
compilation_cases = tqdm(
list(compilation_cases),
desc="Capturing CUDA graph shapes")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about this, michael

desc="Capturing CUDA graphs",
total=len(self.cudagraph_batch_sizes)):
Comment on lines +2038 to +2040
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of tqdm for progress during CUDA graph capture is a good user experience improvement.

To provide more fine-grained control, especially in non-interactive environments or when users prefer quieter logs, consider making the progress bar display conditional. A common approach is to tie its visibility to the logging level. For example, you could disable tqdm if the effective log level is WARNING or higher.

This aligns with how tqdm is used elsewhere in vLLM (e.g., for model weight loading, which uses load_config.use_tqdm_on_load). While adding a new config flag to CompilationConfig (e.g., use_tqdm_on_capture) would be the most consistent, a simpler approach using the existing logger is also possible.

(Note: The suggestion below would also require import logging at the top of the file.)

Suggested change
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes),
desc="Capturing CUDA graphs",
total=len(self.cudagraph_batch_sizes)):
# Determine if tqdm should be disabled based on log level
# (add `import logging` at the top of the file if not already present)
effective_log_level = logger.getEffectiveLevel()
# Disable tqdm if logging level is WARNING or higher. Adjust as needed.
disable_tqdm = effective_log_level >= logging.WARNING
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes),
desc="Capturing CUDA graphs",
total=len(self.cudagraph_batch_sizes),
disable=disable_tqdm):

for _ in range(self.vllm_config.compilation_config.
cudagraph_num_of_warmups):
self._dummy_run(num_tokens, skip_attn=skip_attn)
Expand Down