-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[CPU] V1 support for the CPU backend #16441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
fa7a156
to
88b96c4
Compare
Nice, can you also update the V1 User Guide to reflect this support? |
@WoosukKwon @Isotr0py @mgoin can you take a look at this as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some initial comments, PTAL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "TORCH_SDPA" | |
return "TORCH_SDPA_VLLM_V1" |
vllm/v1/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will also need a ModelRunnerBase
class for v1 to avoid directly inherit from GPU runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, maybe it should be done when the GPU runner becomes stable. At least the CPU V1 runner can fully reuse the GPU runner with limited additional changes.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should put the v1 sdpa attention backend at vllm/v1/attention/backends
to avoid coupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a new AttentionImpl for v1 instead of using the legacy ones, because prefix caching and chunk prefill are always enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, chunked_prefill
is always enabled when building metadata for v1. Right now I think the legacy impl is enough for enabling v1 for CPU, and perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.
Sure! In fact, I also expect Flex Attn can support on CPU backend: #16078. Perhaps we can switch to FlexAttn from SDPA once it lands in the future.
seq_lens=prefill_seq_lens, | ||
seq_lens_tensor=seq_lens_tensor, | ||
max_query_len=max_query_len, | ||
max_kv_len=max_kv_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this file require changes?
# For chunked prefill only | ||
max_query_len: Optional[int] = None | ||
max_kv_len: Optional[int] = None | ||
query_start_loc: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this file require changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a naming conflict. The V1 model runner use query_start_loc
for logits indexing specially, contains all tokens in a batch. But in torch_sdpa
, query_start_loc
contains prefill tokens only, so rename it to prefill_query_start_loc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to avoid the renaming in torch sdpa instead of the global change
num_kv_heads, head_size) | ||
|
||
@staticmethod | ||
def swap_blocks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think swap_blocks or copy_blocks are needed for V1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
vllm/v1/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hacky. Have you considered just duplicating GPUModelRunner while cutting out all unsupported features or unnecessary device/cpu double tensors?
We have done something similar for TPU https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/tpu_model_runner.py.
It's uglier to the eyes but it's easier to control until things are more stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly it is a bit hacky. We actually did the way in V0, it did well to cut out some hardcoded unsupported features for CPU such as CUDAGraph, FlashAttnBackend. But eventually the V0 CPU runner diverged with the GPU runner and can't leverage some new features and refactors directly.
Compared with V0, V1 model runner has better designs in abstraction, like unified input data management, vLLM compilation, attention backend builder and reorder, .etc. These abstraction allows the CPU model runner be more compatible with the GPU. Moreover, I think the V1 model runner has became relatively stable as almost no changes required when rebasing this PR.
So I would prefer to totally reuse the GPU model runner for now, even with some hacky. If some day we have to decouple them, I think it will not be very difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I definitely see your point and the strengths of leveraging an hierarchical structure, I am not 100% sure there won't be quite recurring hiccups to fix the runner as an unintended consequence of some other unrelated PR.
Still, I am not against this implementation, perhaps we could a better job at writing platform-aware code in the other parts of the code.
Eg (not for this PR) abstracting away those cpu/device tensors in a base class so that we don't have to resort to this workaround here.
vllm/platforms/cpu.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think the cascade attn is only supported in the flash attn backend so I disable it here.
I noticed support_sleep_mode
has became a platform attribute, remove the checking here.
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Hi @simon-mo, all required tests became green. We enabled CPU V1 as default and all CPU tests compatible with the V1 engine passed. https://buildkite.com/vllm/ci/builds/21328#0197361e-3946-4915-953d-7a8baeae1137 I think this PR is ready to merge, please take a look, thanks :) |
resolve #16056
Support all features listed in the CPU doc excepts FP8 KV cache.
Changes
CPUWorker
andCPUModelRunner
, derived fromWorker
andGPUModelRunner
to reduce code duplication.TorchSDPABackend
with compatible interfaces forGPUModelRunner
, such asreorder_batch
andbuild
.GPUModelRunner
to avoid importing flash-attn explicitly and using defaultnccl
dist backend.