[Performance] V1 Pooling Models E2E Performance Optimization #23162

noooop · 2025-08-19T07:47:03Z

Purpose

V1 Pooling Models E2E Performance Optimization

#  https://github.com/noooop/snippet/blob/main/benchmarks/embed/profile.py
VLLM_TORCH_PROFILER_DIR=/xxx python profile.py

main CPU: 67.134ms CUDA: 46.417ms -> this pr CPU: 39.740ms CUDA: 39.890ms

benchmarks:

sentence_transformers: https://github.com/noooop/snippet/blob/main/benchmarks/embed/st.py
v0: https://github.com/noooop/snippet/blob/main/benchmarks/embed/v0.py
v1: https://github.com/noooop/snippet/blob/main/benchmarks/embed/v1.py

Result:

X-axis: Throughput (token/s)
Y-axis: Latency, Time needed for one step (ms)
The curve lower right is better ↘

Frankly speaking,
this optimizations only show significant improvements when handling many small requests—they have almost no impact on large requests (since the backbone network latency dominates in such cases).

Of course, a faster implementation is always cooler.

Please click here for long details.
↓↓↓↓↓↓↓↓↓

replace_roberta_positions takes too long

6d8f55e: CPU: 67.134ms CUDA: 46.417ms -> CPU: 53.261ms CUDA: 42.820ms

Correct Testing

pytest -s -vvv tests/models/language/pooling/test_baai.py
- BAAI/bge-m3 for XLMRobertaModel
- BAAI/bge-reranker-base for XLMRobertaForSequenceClassification

reduce cuda sync

8ff2418: CPU: 53.261ms CUDA: 42.820ms -> CPU: 43.016ms CUDA: 42.516ms

Use as many batch operations as possible

13e44df: CPU: 43.016ms CUDA: 42.516ms -> CPU: 42.463ms CUDA: 40.629ms

non_blocking seq_lens (Use seq_lens_cpu directly later

876cb9a: CPU: 42.463ms CUDA: 40.629ms -> CPU: 39.934ms CUDA: 40.128ms

remove prompt_len == hidden_states.shape[0] cuda sync by using prompt_len (cpu tenser) == num_scheduled_tokens (cpu tenser)

93d0a95: CPU: 39.934ms CUDA: 40.128ms -> CPU: 39.509ms CUDA: 40.094ms

_decode_token_type_ids uses ones_like to avoid getting shape which causes cuda sync

f649899: CPU: 39.509ms CUDA: 40.094ms -> CPU: 39.317ms CUDA: 40.109ms

non_blocking torch.zeros in _build_encoder_only_attn_metadata

d7fa9a8: CPU: 39.317ms CUDA: 40.109ms -> CPU: 39.182ms CUDA: 39.981ms

using pooling_cursor Rather than torch.split(hidden_states, num_scheduled_tokens_list)

2f10806: CPU: 39.182ms CUDA: 39.981ms -> CPU: 39.740ms CUDA: 39.890ms

Test Plan

keep CI green

Test Result

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: wang.yuqi <noooop@126.com>

vllm/model_executor/models/roberta.py

github-actions · 2025-08-19T07:52:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: wang.yuqi <noooop@126.com>

…uses cuda sync Signed-off-by: wang.yuqi <noooop@126.com>

Signed-off-by: wang.yuqi <noooop@126.com>

dosubot · 2025-08-20T13:46:34Z

Related Documentation

No published documentation to review for changes on this repository.
Write your first living document

^{How did I do? Any feedback?}

vllm/v1/pool/metadata.py

maxdebayser · 2025-08-20T16:31:48Z

@noooop , nice work! There are two optimizations in this PR: the roberta position ids and removing the split of the hidden states. Do you have benchmark results on how much each individually changes the performance compared to the main branch?

noooop · 2025-08-20T16:44:08Z

@noooop , nice work! There are two optimizations in this PR: the roberta position ids and removing the split of the hidden states. Do you have benchmark results on how much each individually changes the performance compared to the main branch?

Please click detail for long details.

noooop · 2025-08-20T17:01:13Z

these optimizations only show significant improvements when handling many small requests—they have almost no impact on large requests (since the backbone network latency dominates in such cases).

noooop · 2025-08-21T10:07:02Z

@DarkLight1337

By the way, since #22878 Pooling models mteb test uses enforce_eager, mteb_test_embed_models has become less (never) flaky.

I'll reset the threshold back to MTEB_EMBED_TOL = 1e-4 next month.

I'm not 100% sure that flaky is caused by torch.compile. It is very likely caused by it.

cc @robertgshaw2-redhat

DarkLight1337

Tests pass so this should be good to go, thanks!

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Danila Kirichko <d.kirichko@mts.ai>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Boyuan Feng <boyuan@meta.com>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: root <xwq391974@alibaba-inc.com>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

speed up replace_roberta_positions

6d8f55e

Signed-off-by: wang.yuqi <noooop@126.com>

noooop commented Aug 19, 2025

View reviewed changes

vllm/model_executor/models/roberta.py Show resolved Hide resolved

reduce cuda sync

8ff2418

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot added the v1 label Aug 19, 2025

noooop added 2 commits August 19, 2025 17:23

Use as many batch operations as possible

13e44df

Signed-off-by: wang.yuqi <noooop@126.com>

non_blocking seq_lens

876cb9a

Signed-off-by: wang.yuqi <noooop@126.com>

noooop changed the title ~~[Model] Pooling Models E2E Performance Optimization~~ [Performance] Pooling Models E2E Performance Optimization Aug 20, 2025

noooop added 10 commits August 20, 2025 14:51

remove prompt_len == hidden_states.shape[0] cuda sync

93d0a95

Signed-off-by: wang.yuqi <noooop@126.com>

_decode_token_type_ids uses ones_like to avoid getting shape which ca…

f649899

…uses cuda sync Signed-off-by: wang.yuqi <noooop@126.com>

non_blocking torch.zeros in _build_encoder_only_attn_metadata

d7fa9a8

Signed-off-by: wang.yuqi <noooop@126.com>

using pooling_cursor

9394d5a

Signed-off-by: wang.yuqi <noooop@126.com>

using pooling_cursor

0c415f2

Signed-off-by: wang.yuqi <noooop@126.com>

using pooling_cursor

7602149

Signed-off-by: wang.yuqi <noooop@126.com>

fix

d385c29

Signed-off-by: wang.yuqi <noooop@126.com>

fix

6bc2cf5

Signed-off-by: wang.yuqi <noooop@126.com>

fix

f4316ac

Signed-off-by: wang.yuqi <noooop@126.com>

Merge branch 'main' into pooling_e2e

55fcd3a

noooop marked this pull request as ready for review August 20, 2025 13:46

noooop requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 20, 2025 13:46

maxdebayser reviewed Aug 20, 2025

View reviewed changes

vllm/v1/pool/metadata.py Show resolved Hide resolved

noooop changed the title ~~[Performance] Pooling Models E2E Performance Optimization~~ [Performance] V1 Pooling Models E2E Performance Optimization Aug 21, 2025

DarkLight1337 approved these changes Aug 21, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 21, 2025 11:30

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 21, 2025

DarkLight1337 merged commit d70a166 into vllm-project:main Aug 21, 2025
60 checks passed

noooop deleted the pooling_e2e branch August 21, 2025 13:43

PapaGoose pushed a commit to PapaGoose/vllm that referenced this pull request Aug 21, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

276270a

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Danila Kirichko <d.kirichko@mts.ai>

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

1dc73ba

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 21, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

9ba00a4

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Boyuan Feng <boyuan@meta.com>

wangxiyuan mentioned this pull request Aug 22, 2025

[Embedding] Recover embedding function vllm-project/vllm-ascend#2483

Merged

bigPYJ1151 mentioned this pull request Aug 22, 2025

[Bugfix] Fix pooling models on non-CUDA devices #23392

Merged

4 tasks

wangxiyuan mentioned this pull request Aug 22, 2025

[Platform] Make current_stream platform agnostic #23397

Closed

4 tasks

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 22, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

5b0ceea

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

Xu-Wenqing pushed a commit to Xu-Wenqing/vllm that referenced this pull request Aug 23, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

135295c

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: root <xwq391974@alibaba-inc.com>

noooop mentioned this pull request Aug 25, 2025

[RFC]: Redesigning Persistent Batch in vLLM #23446

Open

1 task

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

e6d4261

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 28, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

8ac2c1a

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

b7f6363

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

305d07a

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

a0da309

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

5882ff3

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

c8b668e

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

6350547

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

d273675

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

d5ffe02

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

ba5f9d4

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

d347aac

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

a0b5cb5

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

langc23 pushed a commit to zte-riscv/vllm that referenced this pull request Sep 23, 2025

[Performance] V1 Pooling Models E2E Performance Optimization (vllm-pr…

c561da8

…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] V1 Pooling Models E2E Performance Optimization #23162

[Performance] V1 Pooling Models E2E Performance Optimization #23162

Uh oh!

noooop commented Aug 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

dosubot bot commented Aug 20, 2025

Uh oh!

Uh oh!

maxdebayser commented Aug 20, 2025

Uh oh!

noooop commented Aug 20, 2025 •

edited

Loading

Uh oh!

noooop commented Aug 20, 2025 •

edited

Loading

Uh oh!

noooop commented Aug 21, 2025 •

edited

Loading

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Performance] V1 Pooling Models E2E Performance Optimization #23162

[Performance] V1 Pooling Models E2E Performance Optimization #23162

Uh oh!

Conversation

noooop commented Aug 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

dosubot bot commented Aug 20, 2025

Uh oh!

Uh oh!

maxdebayser commented Aug 20, 2025

Uh oh!

noooop commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

noooop commented Aug 19, 2025 •

edited by github-actions bot

Loading

noooop commented Aug 20, 2025 •

edited

Loading

noooop commented Aug 20, 2025 •

edited

Loading

noooop commented Aug 21, 2025 •

edited

Loading