-
-
Notifications
You must be signed in to change notification settings - Fork 11k
GPU Model Runner V2 #25266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GPU Model Runner V2 #25266
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
| # Request not found. | ||
| return | ||
| self.index_to_req_id.pop(req_idx, None) | ||
| self.free_indices.append(req_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be easier to make this a set so that we can check uniqueness?
vllm/v1/worker/gpu/attn_utils.py
Outdated
| for group in kv_cache_config.kv_cache_groups: | ||
| for layer_name in group.layer_names: | ||
| layer_names.add(layer_name) | ||
| assert layer_names == set(kv_cache_raw_tensors.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe cache this
| attn_layers = get_layers_from_vllm_config(vllm_config, Attention) | ||
| for kv_cache_group_spec in kv_cache_config.kv_cache_groups: | ||
| layer_names = kv_cache_group_spec.layer_names | ||
| any_layer_name = next(iter(layer_names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to assume an always on hybrid-kv-cache manager; I am 100% supportive of this but the iirc the reason we still supported disabling the hybrid-kv-cache manager was because P/D did not support the hybrid kv-cache manager yet; cc @NickLucche @heheda12345 do you know the state of P/D + hybrid-kv-cache? is there any other reason we would want to disable the hybrid-kv-cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not supported on PD, @KuntaiDu has a series of PRs to enable hybrid allocator + kv connectors first
vllm/v1/worker/gpu/model_runner.py
Outdated
| [scheduler_output.num_scheduled_tokens[i] for i in req_ids], | ||
| dtype=np.int32) | ||
|
|
||
| # TODO(woosuk): Support CUDA graphs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! happy to see this above attention metadata prep; we should definitely treat full-cudagraphs as first-class and I think padding for cudagraphs before attention metadata prep has been a major source of headaches/bugs; very excited about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon great work, I really like the direction.
Left a few comments of random things I noticed.
And of course there will be some work to integrate with other features as discussed.
vllm/v1/worker/gpu/states.py
Outdated
| def add_request( | ||
| self, | ||
| req_id: str, | ||
| prompt_token_ids: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also look at keeping things as np arrays end to end, or array.array for lists that grow (for later).
| ) | ||
| self.np = np.zeros_like(self.buffer.np) | ||
|
|
||
| def copy_np_to_gpu(self, x: np.ndarray) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have this take the mapping?
| # TODO(woosuk): Support CUDA graphs. | ||
| num_tokens_after_padding = num_tokens | ||
|
|
||
| idx_mapping_list = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we skip updating idx_mapping and recreating sampling metadata if req_ids list hasn't changed?
|
I'm not sure if this is expected behavior, given that this PR is only a draft, but when testing this PR with If I remove git clone --branch woosuk/model-runner-v2 https://github.com/vllm-project/vllm && cd vllm && git reset --hard 866eef50cae7f9a5f10dbbad8cdf34d07c943f1b
VLLM_USE_PRECOMPILED=1 uv pip install --editable .[flashinfer]
vllm serve RedHatAI/DeepSeek-R1-0528-quantized.w4a16 --tensor-parallel-size 4 -dcp 4 --async-scheduling --served-model-name default --max-model-len 9216Then I send about 50 concurrent requests at it for a minute or so (with several requests sharing a prefix - unsure if that matters), and it crashes. Click to see crash logsNotes:
|
Is it possible you're missing a |
cf0666e to
09e4b2f
Compare
|
Documentation preview: https://vllm--25266.org.readthedocs.build/en/25266/ |
@josephrocca there are two dcp bugfix !26296 and !27518 (still open) , your case should be fixed by !26296, you can try cherrypick these fixes |
Key Changes
max_model_lenandnum_kv_groupsincreaseModelRunnerOutput&SchedulerOutputTODOs