[CPU] V1 support for the CPU backend #16441

bigPYJ1151 · 2025-04-11T01:44:56Z

Support all features listed in the CPU doc excepts FP8 KV cache.

Changes

Add V1 CPUWorker and CPUModelRunner, derived from Worker and GPUModelRunner to reduce code duplication.
Add V1 TorchSDPABackend with compatible interfaces for GPUModelRunner, such as reorder_batch and build.
Additional changes in GPUModelRunner to avoid importing flash-attn explicitly and using default nccl dist backend.
Enable CPU tests support the V1 engine.

github-actions · 2025-04-11T01:45:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-11T01:45:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bigPYJ1151.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

DarkLight1337 · 2025-04-11T06:18:33Z

Nice, can you also update the V1 User Guide to reflect this support?

vllm/v1/worker/gpu_model_runner.py

DarkLight1337 · 2025-04-11T09:36:36Z

@WoosukKwon @Isotr0py @mgoin can you take a look at this as well?

Isotr0py

Added some initial comments, PTAL!

Isotr0py · 2025-04-11T12:33:43Z

vllm/attention/backends/torch_sdpa.py

Suggested change

return "TORCH_SDPA"

return "TORCH_SDPA_VLLM_V1"

Isotr0py · 2025-04-11T12:39:25Z

vllm/v1/worker/cpu_model_runner.py

I think we will also need a ModelRunnerBase class for v1 to avoid directly inherit from GPU runner.

Absolutely, maybe it should be done when the GPU runner becomes stable. At least the CPU V1 runner can fully reuse the GPU runner with limited additional changes.

.

Isotr0py · 2025-04-11T12:40:41Z

vllm/attention/backends/torch_sdpa.py

I think we should put the v1 sdpa attention backend at vllm/v1/attention/backends to avoid coupling.

Good idea :)

Isotr0py · 2025-04-11T12:41:16Z

vllm/attention/backends/torch_sdpa.py

Isotr0py · 2025-04-11T12:49:16Z

vllm/attention/backends/torch_sdpa.py

We should add a new AttentionImpl for v1 instead of using the legacy ones, because prefix caching and chunk prefill are always enabled by default.

Yes, chunked_prefill is always enabled when building metadata for v1. Right now I think the legacy impl is enough for enabling v1 for CPU, and perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.

perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.

Sure! In fact, I also expect Flex Attn can support on CPU backend: #16078. Perhaps we can switch to FlexAttn from SDPA once it lands in the future.

robertgshaw2-redhat · 2025-05-07T21:19:14Z

vllm/attention/backends/cpu_mla.py

            seq_lens=prefill_seq_lens,
            seq_lens_tensor=seq_lens_tensor,
            max_query_len=max_query_len,
            max_kv_len=max_kv_len,


why does this file require changes?

robertgshaw2-redhat · 2025-05-07T21:19:23Z

vllm/attention/backends/torch_sdpa.py

    # For chunked prefill only
    max_query_len: Optional[int] = None
    max_kv_len: Optional[int] = None
-    query_start_loc: Optional[torch.Tensor] = None


why does this file require changes?

It is a naming conflict. The V1 model runner use query_start_loc for logits indexing specially, contains all tokens in a batch. But in torch_sdpa, query_start_loc contains prefill tokens only, so rename it to prefill_query_start_loc.

I would try to avoid the renaming in torch sdpa instead of the global change

robertgshaw2-redhat · 2025-05-07T21:21:26Z

vllm/v1/attention/backends/cpu_attn.py

+                                                 num_kv_heads, head_size)
+
+    @staticmethod
+    def swap_blocks(


I dont think swap_blocks or copy_blocks are needed for V1

NickLucche

Thanks for the contribution!

NickLucche · 2025-05-12T16:22:45Z

vllm/v1/worker/cpu_model_runner.py

This is a bit hacky. Have you considered just duplicating GPUModelRunner while cutting out all unsupported features or unnecessary device/cpu double tensors?
We have done something similar for TPU https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/tpu_model_runner.py.

It's uglier to the eyes but it's easier to control until things are more stable.

Exactly it is a bit hacky. We actually did the way in V0, it did well to cut out some hardcoded unsupported features for CPU such as CUDAGraph, FlashAttnBackend. But eventually the V0 CPU runner diverged with the GPU runner and can't leverage some new features and refactors directly.

Compared with V0, V1 model runner has better designs in abstraction, like unified input data management, vLLM compilation, attention backend builder and reorder, .etc. These abstraction allows the CPU model runner be more compatible with the GPU. Moreover, I think the V1 model runner has became relatively stable as almost no changes required when rebasing this PR.

So I would prefer to totally reuse the GPU model runner for now, even with some hacky. If some day we have to decouple them, I think it will not be very difficult.

While I definitely see your point and the strengths of leveraging an hierarchical structure, I am not 100% sure there won't be quite recurring hiccups to fix the runner as an unintended consequence of some other unrelated PR.

Still, I am not against this implementation, perhaps we could a better job at writing platform-aware code in the other parts of the code.

Eg (not for this PR) abstracting away those cpu/device tensors in a base class so that we don't have to resort to this workaround here.

simon-mo · 2025-05-12T17:49:11Z

vllm/platforms/cpu.py

This is new?

Yes I think the cascade attn is only supported in the flash attn backend so I disable it here.

I noticed support_sleep_mode has became a platform attribute, remove the checking here.

Signed-off-by: jiang.li <jiang1.li@intel.com>

bigPYJ1151 · 2025-06-04T01:40:19Z

Hi @simon-mo, all required tests became green. We enabled CPU V1 as default and all CPU tests compatible with the V1 engine passed. https://buildkite.com/vllm/ci/builds/21328#0197361e-3946-4915-953d-7a8baeae1137

I think this PR is ready to merge, please take a look, thanks :)

bigPYJ1151 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 11, 2025 01:44

mergify bot added the v1 label Apr 11, 2025

mergify bot added the needs-rebase label Apr 11, 2025

bigPYJ1151 force-pushed the cpu_v1_up branch from 53b57d6 to 9bdaf72 Compare April 11, 2025 02:06

mergify bot removed the needs-rebase label Apr 11, 2025

bigPYJ1151 force-pushed the cpu_v1_up branch 5 times, most recently from fa7a156 to 88b96c4 Compare April 11, 2025 04:14

mergify bot added documentation Improvements or additions to documentation ci/build labels Apr 11, 2025

DarkLight1337 reviewed Apr 11, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Isotr0py reviewed Apr 11, 2025

View reviewed changes

bigPYJ1151 force-pushed the cpu_v1_up branch from 8801fed to 4960d65 Compare April 14, 2025 07:44

robertgshaw2-redhat reviewed May 7, 2025

View reviewed changes

bigPYJ1151 force-pushed the cpu_v1_up branch from 4960d65 to 2e0420f Compare May 8, 2025 10:55

NickLucche requested changes May 12, 2025

View reviewed changes

simon-mo reviewed May 12, 2025

View reviewed changes

bigPYJ1151 added 11 commits June 3, 2025 05:22

fix deps

827b074

Signed-off-by: jiang.li <jiang1.li@intel.com>

fix comments

bd1e78a

Signed-off-by: jiang.li <jiang1.li@intel.com>

move out v1 attn

21bcea9

Signed-off-by: jiang.li <jiang1.li@intel.com>

fix

98e44da

Signed-off-by: jiang.li <jiang1.li@intel.com>

rebase

660e340

Signed-off-by: jiang.li <jiang1.li@intel.com>

rebase

f106d9a

Signed-off-by: jiang.li <jiang1.li@intel.com>

rebase

5b2591f

Signed-off-by: jiang.li <jiang1.li@intel.com>

add note

1b25629

Signed-off-by: jiang.li <jiang1.li@intel.com>

fix

df61ca2

Signed-off-by: jiang.li <jiang1.li@intel.com>

fix requirements

413ef08

Signed-off-by: jiang.li <jiang1.li@intel.com>

fix tests

f44b619

Signed-off-by: jiang.li <jiang1.li@intel.com>

bigPYJ1151 force-pushed the cpu_v1_up branch from e04b895 to f44b619 Compare June 3, 2025 06:24

bigPYJ1151 requested a review from tlrmchlsmth as a code owner June 3, 2025 06:24

enable v1 by default

f7de05c

Signed-off-by: jiang.li <jiang1.li@intel.com>

bigPYJ1151 force-pushed the cpu_v1_up branch from fa3621b to f7de05c Compare June 3, 2025 08:30

bigPYJ1151 added 2 commits June 3, 2025 08:32

avoid explicit setting

cce8031

Signed-off-by: jiang.li <jiang1.li@intel.com>

enable mrope and lora

cceb5f0

Signed-off-by: jiang.li <jiang1.li@intel.com>

bigPYJ1151 force-pushed the cpu_v1_up branch from e1ad81c to 8196118 Compare June 3, 2025 09:59

fix test

cfd2ce4

Signed-off-by: jiang.li <jiang1.li@intel.com>

bigPYJ1151 force-pushed the cpu_v1_up branch from 8196118 to cfd2ce4 Compare June 3, 2025 11:41

bigPYJ1151 added 2 commits June 3, 2025 13:12

fix lora

deb2f79

Signed-off-by: jiang.li <jiang1.li@intel.com>

fix test

84e48ce

Signed-off-by: jiang.li <jiang1.li@intel.com>

simon-mo merged commit 4555143 into vllm-project:main Jun 4, 2025
71 checks passed

kebe7jun mentioned this pull request Jun 4, 2025

[Bugfix] fix v1 cpu worker fails on macOS #19121

Merged

3 tasks

DarkLight1337 mentioned this pull request Jun 11, 2025

[Doc] Update V1 User Guide for Hardware and Models #19474

Merged

4 tasks

Liangliang-Ma mentioned this pull request Jul 10, 2025

[XPU] dispatch xpu/cuda specific calls in the model runner #20698

Open

Uh oh!

[CPU] V1 support for the CPU backend #16441

[CPU] V1 support for the CPU backend #16441

Uh oh!

Conversation

bigPYJ1151 commented Apr 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

mergify bot commented Apr 11, 2025

Uh oh!

DarkLight1337 commented Apr 11, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Apr 11, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigPYJ1151 commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

bigPYJ1151 commented Apr 11, 2025 •

edited by github-actions bot

Loading