Skip to content

Conversation

@noooop
Copy link
Collaborator

@noooop noooop commented Oct 8, 2025

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

FIX #26248

  1. Eliminate overhead of converting tensor -> list[float] -> numpy in OpenAI API server
  2. Support float32, float16, bfloat16, fp8_e4m3, fp8_e5m2 embed dtype.
  3. The fp8 is better than expected, perhaps with practical value, please thoroughly test before use, welcome to feedback.

mteb test PTAL #17175
https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_embed_dtype.py

image image

float32 ≈ float16 > bfloat16 > fp8_e4m3 >> fp8_e5m2

model_name float32 float16 bfloat16 fp8_e4m3 fp8_e5m2
jinaai/jina-embeddings-v3 0.824336501 0.824339743 0.824335268 0.824326947 0.824323946
BAAI/bge-m3 0.78734317 0.787339618 0.787350117 0.787262401 0.78773927
intfloat/multilingual-e5-base 0.779325776 0.779326888 0.779314874 0.779537205 0.779854464
BAAI/bge-base-en 0.779337092 0.779340641 0.779370285 0.779269983 0.779406319
Alibaba-NLP/gte-multilingual-base 0.775074375 0.775069325 0.77506051 0.775014809 0.774906984
Qwen/Qwen3-Embedding-0.6B 0.771163535 0.771168301 0.771179591 0.771177116 0.771265142
thenlper/gte-large 0.768076652 0.768071025 0.768082659 0.768333837 0.767567718
Alibaba-NLP/gte-Qwen2-1.5B-instruct 0.758472692 0.75847403 0.758461189 0.758305555 0.758201921
BAAI/bge-code-v1 0.757253707 0.757255983 0.757230987 0.75725761 0.756827524
Alibaba-NLP/gte-modernbert-base 0.748189414 0.748189165 0.748210671 0.748433599 0.744441598
google/embeddinggemma-300m 0.747381858 0.747379788 0.747383603 0.747316076 0.747721116
intfloat/e5-small 0.742285948 0.742284024 0.742277706 0.741824568 0.742915194
nomic-ai/nomic-embed-text-v1 0.737569632 0.737570019 0.737581491 0.73745691 0.737700022
nomic-ai/nomic-embed-text-v2-moe 0.559459109 0.559442844 0.559431305 0.559320019 0.559509537
Snowflake/snowflake-arctic-embed-xs 0.714928682 0.714930751 0.714984003 0.714738386 0.712830927
Snowflake/snowflake-arctic-embed-l-v2.0 0.712258007 0.712257842 0.712263053 0.712389646 0.711971434
Snowflake/snowflake-arctic-embed-m-v2.0 0.706623072 0.706631128 0.706615263 0.706433834 0.706279522
TencentBAC/Conan-embedding-v1 0.688612388 0.688609943 0.68862858 0.688503053 0.688422993
Snowflake/snowflake-arctic-embed-m-long 0.681144894 0.681143862 0.681128579 0.681404862 0.678821797
Snowflake/snowflake-arctic-embed-m-v1.5 0.649088528 0.649089243 0.649064267 0.648947297 0.649799194

Even with fp8_e5m2, the gap is smaller than imagined.

Test Plan

tests/entrypoints/pooling/openai/test_embedding.py
tests/entrypoints/pooling/openai/test_pooling.py

Test Result

pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <noooop@126.com>
@mergify mergify bot added the frontend label Oct 8, 2025
@noooop noooop changed the title [Model] FP16 Embedding Base64 [Model] Support FP16 Embedding Base64 (Still uses fp32 by default). Oct 8, 2025
@noooop noooop changed the title [Model] Support FP16 Embedding Base64 (Still uses fp32 by default). [Frontend] Support FP16 Embedding Base64 (Still uses fp32 by default). Oct 8, 2025
noooop added 2 commits October 9, 2025 12:16
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify
Copy link

mergify bot commented Oct 9, 2025

Documentation preview: https://vllm--26414.org.readthedocs.build/en/26414/

@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 9, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 9, 2025

@uasan

examples/online_serving/pooling/openai_embedding_embed_dtype_client.py

Do you ok with this api?

Yes, this PR can even use fp8. The small-scale test results are quite good. A more detailed test will be provided tomorrow.

Signed-off-by: wang.yuqi <noooop@126.com>
@uasan
Copy link

uasan commented Oct 9, 2025

@noooop Yes embed_dtype I'm quite satisfied with it, thank you!

There are also optional enhancements: binary protocols, such as those in Postgres, always expect big-endian binary numbers; this is generally the de facto network standard for almost all binary protocols, but models typically operate in little-endian format; byte order conversion is always necessary.

Adding the endian parameter also becomes useful.

@noooop noooop marked this pull request as ready for review October 9, 2025 15:29
@noooop
Copy link
Collaborator Author

noooop commented Oct 9, 2025

cc @DarkLight1337 @maxdebayser

Ready for review

  • float32 ≈ float16 > bfloat16 > fp8_e4m3 >> fp8_e5m2 Do we need to support fp8 embedding dtype?
  • (I guess fp8 ue8m0 + scale & bias might be a better choice, using the perspective of model and KV quantization
  • Do we need to add an endian parameter?

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. I've left a few comments but this looks good to me.

@mergify mergify bot removed the needs-rebase label Oct 13, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 13, 2025

cc @DarkLight1337

Is there anything else that needs to be modified in this PR?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this LGTM now thanks

@noooop noooop enabled auto-merge (squash) October 13, 2025 17:13
@noooop noooop merged commit d2a7938 into vllm-project:main Oct 13, 2025
49 checks passed
1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 1994 <1994@users.noreply.github.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
@noooop noooop deleted the embed_fp16 branch October 16, 2025 00:51
@uasan
Copy link

uasan commented Oct 17, 2025

hi @uasan I'm not 100% clear about the use cases of endianness. Please describe it in another issue.

  • How do you plan to load big-endian or little-endian?
  • is it need to go through a proxy layer ?
  • or want to integrate with that API or system, which also requires converting binary to base64?

@noooop I described it here #27063

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…e64 (Still uses fp32 by default). (vllm-project#26414)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: FP16 Embedding Base64

5 participants