-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[Performance] V1 Pooling Models E2E Performance Optimization #23162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: wang.yuqi <noooop@126.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
…uses cuda sync Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Related Documentation No published documentation to review for changes on this repository. |
@noooop , nice work! There are two optimizations in this PR: the roberta position ids and removing the split of the hidden states. Do you have benchmark results on how much each individually changes the performance compared to the |
Please click detail for long details. |
these optimizations only show significant improvements when handling many small requests—they have almost no impact on large requests (since the backbone network latency dominates in such cases). |
By the way, since #22878 Pooling models mteb test uses enforce_eager, mteb_test_embed_models has become less (never) flaky. I'll reset the threshold back to MTEB_EMBED_TOL = 1e-4 next month. ![]() I'm not 100% sure that flaky is caused by torch.compile. It is very likely caused by it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass so this should be good to go, thanks!
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Danila Kirichko <d.kirichko@mts.ai>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Boyuan Feng <boyuan@meta.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: root <xwq391974@alibaba-inc.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
…oject#23162) Signed-off-by: wang.yuqi <noooop@126.com>
Purpose
V1 Pooling Models E2E Performance Optimization
main CPU: 67.134ms CUDA: 46.417ms -> this pr CPU: 39.740ms CUDA: 39.890ms
benchmarks:
Result:
Frankly speaking,
this optimizations only show significant improvements when handling many small requests—they have almost no impact on large requests (since the backbone network latency dominates in such cases).
Of course, a faster implementation is always cooler.
Please click here for long details.
↓↓↓↓↓↓↓↓↓
6d8f55e: CPU: 67.134ms CUDA: 46.417ms -> CPU: 53.261ms CUDA: 42.820ms
Correct Testing
8ff2418: CPU: 53.261ms CUDA: 42.820ms -> CPU: 43.016ms CUDA: 42.516ms
13e44df: CPU: 43.016ms CUDA: 42.516ms -> CPU: 42.463ms CUDA: 40.629ms
876cb9a: CPU: 42.463ms CUDA: 40.629ms -> CPU: 39.934ms CUDA: 40.128ms
93d0a95: CPU: 39.934ms CUDA: 40.128ms -> CPU: 39.509ms CUDA: 40.094ms
f649899: CPU: 39.509ms CUDA: 40.094ms -> CPU: 39.317ms CUDA: 40.109ms
d7fa9a8: CPU: 39.317ms CUDA: 40.109ms -> CPU: 39.182ms CUDA: 39.981ms
2f10806: CPU: 39.182ms CUDA: 39.981ms -> CPU: 39.740ms CUDA: 39.890ms
Test Plan
keep CI green
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.