Skip to content

Conversation

bigPYJ1151
Copy link
Member

@bigPYJ1151 bigPYJ1151 commented Apr 11, 2025

resolve #16056

Support all features listed in the CPU doc excepts FP8 KV cache.

Changes

  • Add V1 CPUWorker and CPUModelRunner, derived from Worker and GPUModelRunner to reduce code duplication.
  • Add V1 TorchSDPABackend with compatible interfaces for GPUModelRunner, such as reorder_batch and build.
  • Additional changes in GPUModelRunner to avoid importing flash-attn explicitly and using default nccl dist backend.
  • Enable CPU tests support the V1 engine.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 11, 2025
Copy link

mergify bot commented Apr 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bigPYJ1151.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 11, 2025
@mergify mergify bot removed the needs-rebase label Apr 11, 2025
@bigPYJ1151 bigPYJ1151 force-pushed the cpu_v1_up branch 5 times, most recently from fa7a156 to 88b96c4 Compare April 11, 2025 04:14
@DarkLight1337
Copy link
Member

Nice, can you also update the V1 User Guide to reflect this support?

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Apr 11, 2025
@DarkLight1337
Copy link
Member

@WoosukKwon @Isotr0py @mgoin can you take a look at this as well?

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some initial comments, PTAL!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "TORCH_SDPA"
return "TORCH_SDPA_VLLM_V1"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will also need a ModelRunnerBase class for v1 to avoid directly inherit from GPU runner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, maybe it should be done when the GPU runner becomes stable. At least the CPU V1 runner can fully reuse the GPU runner with limited additional changes.

.

Comment on lines 81 to 83
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put the v1 sdpa attention backend at vllm/v1/attention/backends to avoid coupling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea :)

Comment on lines 346 to 347
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Comment on lines 88 to 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a new AttentionImpl for v1 instead of using the legacy ones, because prefix caching and chunk prefill are always enabled by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, chunked_prefill is always enabled when building metadata for v1. Right now I think the legacy impl is enough for enabling v1 for CPU, and perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.

Sure! In fact, I also expect Flex Attn can support on CPU backend: #16078. Perhaps we can switch to FlexAttn from SDPA once it lands in the future.

seq_lens=prefill_seq_lens,
seq_lens_tensor=seq_lens_tensor,
max_query_len=max_query_len,
max_kv_len=max_kv_len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this file require changes?

# For chunked prefill only
max_query_len: Optional[int] = None
max_kv_len: Optional[int] = None
query_start_loc: Optional[torch.Tensor] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this file require changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a naming conflict. The V1 model runner use query_start_loc for logits indexing specially, contains all tokens in a batch. But in torch_sdpa, query_start_loc contains prefill tokens only, so rename it to prefill_query_start_loc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid the renaming in torch sdpa instead of the global change

num_kv_heads, head_size)

@staticmethod
def swap_blocks(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think swap_blocks or copy_blocks are needed for V1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky. Have you considered just duplicating GPUModelRunner while cutting out all unsupported features or unnecessary device/cpu double tensors?
We have done something similar for TPU https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/tpu_model_runner.py.

It's uglier to the eyes but it's easier to control until things are more stable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly it is a bit hacky. We actually did the way in V0, it did well to cut out some hardcoded unsupported features for CPU such as CUDAGraph, FlashAttnBackend. But eventually the V0 CPU runner diverged with the GPU runner and can't leverage some new features and refactors directly.

Compared with V0, V1 model runner has better designs in abstraction, like unified input data management, vLLM compilation, attention backend builder and reorder, .etc. These abstraction allows the CPU model runner be more compatible with the GPU. Moreover, I think the V1 model runner has became relatively stable as almost no changes required when rebasing this PR.

So I would prefer to totally reuse the GPU model runner for now, even with some hacky. If some day we have to decouple them, I think it will not be very difficult.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I definitely see your point and the strengths of leveraging an hierarchical structure, I am not 100% sure there won't be quite recurring hiccups to fix the runner as an unintended consequence of some other unrelated PR.

Still, I am not against this implementation, perhaps we could a better job at writing platform-aware code in the other parts of the code.

Eg (not for this PR) abstracting away those cpu/device tensors in a base class so that we don't have to resort to this workaround here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the cascade attn is only supported in the flash attn backend so I disable it here.

I noticed support_sleep_mode has became a platform attribute, remove the checking here.

bigPYJ1151 added 11 commits June 3, 2025 05:22
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
@bigPYJ1151
Copy link
Member Author

Hi @simon-mo, all required tests became green. We enabled CPU V1 as default and all CPU tests compatible with the V1 engine passed. https://buildkite.com/vllm/ci/builds/21328#0197361e-3946-4915-953d-7a8baeae1137

I think this PR is ready to merge, please take a look, thanks :)

@simon-mo simon-mo merged commit 4555143 into vllm-project:main Jun 4, 2025
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Usage]: v1 engine on CPU
8 participants