-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[UX] Add Feedback During CUDAGraph Capture #19501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -12,6 +12,7 @@ | |||||||||||||||||||||||||||
import torch | ||||||||||||||||||||||||||||
import torch.distributed | ||||||||||||||||||||||||||||
import torch.nn as nn | ||||||||||||||||||||||||||||
from tqdm import tqdm | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
import vllm.envs as envs | ||||||||||||||||||||||||||||
from vllm.attention import AttentionType, get_attn_backend | ||||||||||||||||||||||||||||
|
@@ -2034,7 +2035,9 @@ def capture_model(self) -> None: | |||||||||||||||||||||||||||
# can reuse the memory pool allocated for the large shapes. | ||||||||||||||||||||||||||||
with graph_capture(device=self.device): | ||||||||||||||||||||||||||||
skip_attn = not self.vllm_config.compilation_config.full_cuda_graph | ||||||||||||||||||||||||||||
for num_tokens in reversed(self.cudagraph_batch_sizes): | ||||||||||||||||||||||||||||
for num_tokens in tqdm(reversed(self.cudagraph_batch_sizes), | ||||||||||||||||||||||||||||
desc="Capturing CUDA graphs", | ||||||||||||||||||||||||||||
total=len(self.cudagraph_batch_sizes)): | ||||||||||||||||||||||||||||
Comment on lines
+2038
to
+2040
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The addition of To provide more fine-grained control, especially in non-interactive environments or when users prefer quieter logs, consider making the progress bar display conditional. A common approach is to tie its visibility to the logging level. For example, you could disable This aligns with how (Note: The suggestion below would also require
Suggested change
|
||||||||||||||||||||||||||||
for _ in range(self.vllm_config.compilation_config. | ||||||||||||||||||||||||||||
cudagraph_num_of_warmups): | ||||||||||||||||||||||||||||
self._dummy_run(num_tokens, skip_attn=skip_attn) | ||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the tqdm only run on the first model runner, otherwise we will have each TP/PP rank clobber each other as they increment
See how we did it in V0, although I think we should generalize this to any parallelism rather than just TP
vllm/vllm/worker/model_runner.py
Lines 1585 to 1589 in 29fa5ca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about this, michael