Skip to content

Support embedding models in V1 #16188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 98 commits into from
Jun 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
f36c4f9
Remove guardrails that prevent V1 from trying to run embedding models
maxdebayser Mar 24, 2025
acf4638
hack v1 flash_attn to support encoder_only
maxdebayser Apr 3, 2025
b13bbc0
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 3, 2025
8debea0
Revert changes to disable kv caching for encoder-only models
maxdebayser Apr 3, 2025
8d97b9c
Add pooling support in v1
maxdebayser Apr 5, 2025
d60b22b
First end-to-end working version of Bert embeddings in V1
maxdebayser Apr 7, 2025
6bebbb8
Support warmup for pooling models in V1
maxdebayser Apr 7, 2025
6dafd71
address review comments
maxdebayser Apr 7, 2025
e2724a2
address review comments
maxdebayser Apr 7, 2025
56ff6cd
remove debug prints
maxdebayser Apr 7, 2025
fc57edd
address review comments
maxdebayser Apr 7, 2025
64a0e62
Fix cross encoder models in V1 and enable tests for pooling models
maxdebayser Apr 8, 2025
4014d41
address review comments
maxdebayser Apr 8, 2025
87a95a8
Merge branch 'main' into v1_embeddings
maxdebayser Apr 8, 2025
902c129
address review comments
maxdebayser Apr 8, 2025
2c68855
re-enable large embedding models
maxdebayser Apr 8, 2025
8afd8f5
address review comments
maxdebayser Apr 8, 2025
7762976
Merge branch 'main' into v1_embeddings
maxdebayser Apr 8, 2025
d7537ae
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 8, 2025
a9e7747
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 9, 2025
17520bd
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 14, 2025
90c611a
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 15, 2025
dec2441
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 17, 2025
a5e83f4
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 23, 2025
187f69b
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 24, 2025
69a0332
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 29, 2025
a9f1721
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 29, 2025
4b066a3
fix merge problems
maxdebayser Apr 30, 2025
43a26dc
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 30, 2025
ca34513
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 30, 2025
bf3033d
Fix missing qwen embedding model param
maxdebayser Apr 30, 2025
67bf727
Make pooling params reach the pooling in V1
maxdebayser May 1, 2025
93b6361
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 1, 2025
d916b88
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 10, 2025
bad4211
fix merge problems
maxdebayser May 10, 2025
35d9bd9
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 11, 2025
dcc6100
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 12, 2025
a4f85b5
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 13, 2025
a5f328a
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 15, 2025
7c5be88
fix merge problem
maxdebayser May 15, 2025
29b75c9
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 4, 2025
6aa204c
backport changes from the other PR
maxdebayser Jun 4, 2025
e81470c
fix merge errors
maxdebayser Jun 4, 2025
20e7140
address review comments
maxdebayser Jun 4, 2025
6bc1e3d
address review comments
maxdebayser Jun 4, 2025
22825bd
simplify PR
maxdebayser Jun 4, 2025
c889b2e
fix mistake
maxdebayser Jun 4, 2025
24462e4
workaround qwen model test issue
maxdebayser Jun 6, 2025
b5f21f2
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 6, 2025
79d1b95
revert unecessary change
maxdebayser Jun 6, 2025
b3a0491
remove duplicated code
maxdebayser Jun 6, 2025
b4ab556
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 6, 2025
1a82e56
remove encoder model support to simplify PR
maxdebayser Jun 7, 2025
a66801b
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 9, 2025
660dd9c
fix several tests
maxdebayser Jun 9, 2025
808c996
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 9, 2025
cdd70c9
Fix test
maxdebayser Jun 9, 2025
0832115
disable bert test
maxdebayser Jun 9, 2025
10bbf74
fix tests
maxdebayser Jun 9, 2025
ee892aa
limit context length to fit test GPU
maxdebayser Jun 9, 2025
2e12eba
limit context length to fit test GPU
maxdebayser Jun 9, 2025
14fcf24
fix test
maxdebayser Jun 10, 2025
0624435
fix test
maxdebayser Jun 10, 2025
706fdb2
Merge branch 'main' into v1_embeddings
22quinn Jun 10, 2025
051f6d4
Fix _construct_cached_request_state
22quinn Jun 10, 2025
214cf06
Fix v1 tests
22quinn Jun 10, 2025
8193bd0
Merge pull request #1 from 22quinn/v1_embeddings
maxdebayser Jun 10, 2025
65b8377
fix test
maxdebayser Jun 10, 2025
33d7f74
Merge branch 'v1_embeddings' of github.com:maxdebayser/vllm into v1_e…
maxdebayser Jun 10, 2025
4ee822a
reduce max_model_len to fit in test gpu
maxdebayser Jun 10, 2025
7242731
fix test
maxdebayser Jun 10, 2025
a4f460b
fix test
maxdebayser Jun 10, 2025
35ca640
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 12, 2025
17f6177
fix test
maxdebayser Jun 12, 2025
3f0d42e
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 12, 2025
74d73cc
use torch.split
maxdebayser Jun 12, 2025
e6a66dc
enable cuda graphs
maxdebayser Jun 12, 2025
4cca774
fix unecessary config.py changes
maxdebayser Jun 12, 2025
8ef1982
fix error message
maxdebayser Jun 12, 2025
28d00d1
remove unused import
maxdebayser Jun 12, 2025
e634f60
fix docstring
maxdebayser Jun 12, 2025
053475c
revert unnecessary code changes
maxdebayser Jun 12, 2025
6228f64
remove debug prints
maxdebayser Jun 12, 2025
42c802a
fix refactoring bug
maxdebayser Jun 12, 2025
f771a19
fix refactoring bug
maxdebayser Jun 12, 2025
02c47ad
Fix default chunked prefill for pooling models
maxdebayser Jun 13, 2025
1fd252c
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 13, 2025
c5c0d97
Revert handling of case that can never happen
maxdebayser Jun 13, 2025
acfc9cc
fix small bug
maxdebayser Jun 13, 2025
225b808
fix small bugs
maxdebayser Jun 13, 2025
2b86c13
fix silly mistake
maxdebayser Jun 13, 2025
2983252
reduce memory usage for small ci gpus
maxdebayser Jun 13, 2025
58c556d
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 13, 2025
878d56a
enable chunked prefill by default for models that support it
maxdebayser Jun 14, 2025
2db273f
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 14, 2025
114af27
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 16, 2025
bc0219d
address review comments
maxdebayser Jun 16, 2025
221f013
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Jun 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion examples/offline_inference/basic/embed.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="intfloat/e5-mistral-7b-instruct", task="embed", enforce_eager=True
model="intfloat/e5-mistral-7b-instruct",
task="embed",
enforce_eager=True,
max_model_len=1024,
)
return parser.parse_args()

Expand Down
1 change: 1 addition & 0 deletions examples/offline_inference/vision_language_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
engine_args = EngineArgs(
model="TIGER-Lab/VLM2Vec-Full",
task="embed",
max_model_len=4096,
trust_remote_code=True,
mm_processor_kwargs={"num_crops": 4},
limit_mm_per_prompt={"image": 1},
Expand Down
32 changes: 18 additions & 14 deletions tests/compile/test_basic_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class TestSetting:
# basic llama model
TestSetting(
model="meta-llama/Llama-3.2-1B-Instruct",
model_args=[],
model_args=["--max-model-len", "2048"],
pp_size=2,
tp_size=2,
attn_backend="FLASHINFER",
Expand All @@ -41,7 +41,7 @@ class TestSetting:
# llama model with quantization
TestSetting(
model="TheBloke/TinyLlama-1.1B-Chat-v0.3-GPTQ",
model_args=["--quantization", "gptq"],
model_args=["--quantization", "gptq", "--max-model-len", "2048"],
pp_size=1,
tp_size=1,
attn_backend="FLASH_ATTN",
Expand All @@ -51,7 +51,7 @@ class TestSetting:
# MoE model
TestSetting(
model="ibm/PowerMoE-3b",
model_args=[],
model_args=["--max-model-len", "2048"],
pp_size=1,
tp_size=2,
attn_backend="FLASH_ATTN",
Expand All @@ -61,23 +61,27 @@ class TestSetting:
# embedding model
TestSetting(
model="BAAI/bge-multilingual-gemma2",
model_args=["--task", "embed", "--dtype", "bfloat16"],
model_args=[
"--task", "embed", "--dtype", "bfloat16", "--max-model-len",
"2048"
],
pp_size=1,
tp_size=1,
attn_backend="FLASH_ATTN",
method="encode",
fullgraph=True,
),
# encoder-based embedding model (BERT)
TestSetting(
model="BAAI/bge-base-en-v1.5",
model_args=["--task", "embed"],
pp_size=1,
tp_size=1,
attn_backend="XFORMERS",
method="encode",
fullgraph=True,
),
# TODO: bert models are not supported in V1 yet
# # encoder-based embedding model (BERT)
# TestSetting(
# model="BAAI/bge-base-en-v1.5",
# model_args=["--task", "embed"],
# pp_size=1,
# tp_size=1,
# attn_backend="XFORMERS",
# method="encode",
# fullgraph=True,
# ),
# vision language model
TestSetting(
model="microsoft/Phi-3.5-vision-instruct",
Expand Down
3 changes: 3 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,13 +145,16 @@ def run_with_both_engines(request, monkeypatch):
# Automatically runs tests twice, once with V1 and once without
use_v1 = request.param
# Tests decorated with `@skip_v1` are only run without v1
skip_v0 = request.node.get_closest_marker("skip_v0")
skip_v1 = request.node.get_closest_marker("skip_v1")

if use_v1:
if skip_v1:
pytest.skip("Skipping test on vllm V1")
monkeypatch.setenv('VLLM_USE_V1', '1')
else:
if skip_v0:
pytest.skip("Skipping test on vllm V0")
monkeypatch.setenv('VLLM_USE_V1', '0')

yield
Expand Down
24 changes: 20 additions & 4 deletions tests/entrypoints/llm/test_encode.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from vllm import LLM, PoolingParams, PoolingRequestOutput
from vllm.distributed import cleanup_dist_env_and_memory

from ...models.utils import check_embeddings_close

MODEL_NAME = "intfloat/multilingual-e5-small"

PROMPTS = [
Expand All @@ -27,6 +29,14 @@
]


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module")
def llm():
# pytest caches the fixture so we use weakref.proxy to
Expand All @@ -46,9 +56,15 @@ def llm():
cleanup_dist_env_and_memory()


def assert_outputs_equal(o1: list[PoolingRequestOutput],
def assert_outputs_match(o1: list[PoolingRequestOutput],
o2: list[PoolingRequestOutput]):
assert [o.outputs for o in o1] == [o.outputs for o in o2]
check_embeddings_close(
embeddings_0_lst=[o.outputs.data for o in o1],
embeddings_1_lst=[o.outputs.data for o in o2],
name_0="hf",
name_1="vllm",
tol=1e-2,
)


@pytest.mark.skip_global_cleanup
Expand All @@ -63,7 +79,7 @@ def test_v1_v2_api_consistency_single_prompt_tokens(llm: LLM,

v2_output = llm.encode({"prompt_token_ids": prompt_token_ids},
pooling_params=pooling_params)
assert_outputs_equal(v1_output, v2_output)
assert_outputs_match(v1_output, v2_output)


@pytest.mark.skip_global_cleanup
Expand All @@ -80,7 +96,7 @@ def test_v1_v2_api_consistency_multi_prompt_tokens(llm: LLM):
} for p in TOKEN_IDS],
pooling_params=pooling_params,
)
assert_outputs_equal(v1_output, v2_output)
assert_outputs_match(v1_output, v2_output)


@pytest.mark.skip_global_cleanup
Expand Down
8 changes: 8 additions & 0 deletions tests/entrypoints/openai/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@
DTYPE = "bfloat16"


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module")
def server():
args = [
Expand Down
15 changes: 11 additions & 4 deletions tests/entrypoints/openai/test_pooling.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import pytest
import requests

from tests.models.utils import check_embeddings_close
from vllm.entrypoints.openai.protocol import PoolingResponse
from vllm.transformers_utils.tokenizer import get_tokenizer

Expand Down Expand Up @@ -223,8 +224,11 @@ async def test_batch_base64_pooling(server: RemoteOpenAIServer,
np.frombuffer(base64.b64decode(data.data),
dtype="float32").tolist())

assert responses_float.data[0].data == decoded_responses_base64_data[0]
assert responses_float.data[1].data == decoded_responses_base64_data[1]
check_embeddings_close(
embeddings_0_lst=[d.data for d in responses_float.data],
embeddings_1_lst=decoded_responses_base64_data,
name_0="float32",
name_1="base64")

# Default response is float32 decoded from base64 by OpenAI Client
default_response = requests.post(
Expand All @@ -237,5 +241,8 @@ async def test_batch_base64_pooling(server: RemoteOpenAIServer,
default_response.raise_for_status()
responses_default = PoolingResponse.model_validate(default_response.json())

assert responses_float.data[0].data == responses_default.data[0].data
assert responses_float.data[1].data == responses_default.data[1].data
check_embeddings_close(
embeddings_0_lst=[d.data for d in responses_default.data],
embeddings_1_lst=[d.data for d in responses_default.data],
name_0="float32",
name_1="base64")
8 changes: 8 additions & 0 deletions tests/entrypoints/openai/test_rerank.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@
DTYPE = "bfloat16"


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module")
def server():
args = ["--enforce-eager", "--max-model-len", "100", "--dtype", DTYPE]
Expand Down
9 changes: 9 additions & 0 deletions tests/entrypoints/openai/test_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@

from ...utils import RemoteOpenAIServer


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


MODELS = [
{
"name": "BAAI/bge-reranker-v2-m3",
Expand Down
10 changes: 9 additions & 1 deletion tests/models/language/pooling/test_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@

from vllm.platforms import current_platform

# TODO: enable when float32 is supported by V1
# @pytest.fixture(autouse=True)
# def v1(run_with_both_engines):
# # Simple autouse wrapper to run both engines for each test
# # This can be promoted up to conftest.py to run for every
# # test in a package
# pass


@pytest.mark.parametrize(
"model",
Expand All @@ -29,7 +37,7 @@ def test_models(
# switch to use ROCm CK FA backend
monkeypatch.setenv("VLLM_USE_TRITON_FLASH_ATTN", "False")

with vllm_runner(model, dtype=dtype) as vllm_model:
with vllm_runner(model, max_model_len=512, dtype=dtype) as vllm_model:
vllm_outputs = vllm_model.classify(example_prompts)

with hf_runner(model,
Expand Down
34 changes: 27 additions & 7 deletions tests/models/language/pooling/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@
from ...utils import check_embeddings_close


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.mark.parametrize(
"model",
[
Expand All @@ -20,15 +28,27 @@
marks=[pytest.mark.core_model]),
pytest.param("intfloat/e5-mistral-7b-instruct",
marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"),
# the qwen models interfere with each other (see PR
# https://github.com/vllm-project/vllm/pull/18720).
# To avoid this problem, for now we skip v0 since it will be
# deprecated anyway.
pytest.param("ssmits/Qwen2-7B-Instruct-embed-base",
marks=[pytest.mark.skip_v0]),
# [Encoder-only]
pytest.param("BAAI/bge-base-en-v1.5",
marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
pytest.param("sentence-transformers/all-MiniLM-L12-v2"),
pytest.param("intfloat/multilingual-e5-small"),
pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct"),
marks=[
pytest.mark.core_model, pytest.mark.cpu_model,
pytest.mark.skip_v1
]),
pytest.param("sentence-transformers/all-MiniLM-L12-v2",
marks=[pytest.mark.skip_v1]),
pytest.param("intfloat/multilingual-e5-small",
marks=[pytest.mark.skip_v1]),
pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct",
marks=[pytest.mark.skip_v1]),
# [Cross-Encoder]
pytest.param("sentence-transformers/stsb-roberta-base-v2"),
pytest.param("sentence-transformers/stsb-roberta-base-v2",
marks=[pytest.mark.skip_v1]),
],
)
def test_models(
Expand Down Expand Up @@ -62,7 +82,7 @@ def test_models(

with vllm_runner(model,
task="embed",
max_model_len=None,
max_model_len=512,
**vllm_extra_kwargs) as vllm_model:
vllm_outputs = vllm_model.encode(example_prompts)

Expand Down
22 changes: 11 additions & 11 deletions tests/models/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,8 +265,8 @@ def check_available_online(

_EMBEDDING_EXAMPLE_MODELS = {
# [Text-only]
"BertModel": _HfExamplesInfo("BAAI/bge-base-en-v1.5"),
"Gemma2Model": _HfExamplesInfo("BAAI/bge-multilingual-gemma2"),
"BertModel": _HfExamplesInfo("BAAI/bge-base-en-v1.5", v0_only=True),
"Gemma2Model": _HfExamplesInfo("BAAI/bge-multilingual-gemma2", v0_only=True), # noqa: E501
"GritLM": _HfExamplesInfo("parasail-ai/GritLM-7B-vllm"),
"GteModel": _HfExamplesInfo("Snowflake/snowflake-arctic-embed-m-v2.0",
trust_remote_code=True),
Expand All @@ -279,16 +279,16 @@ def check_available_online(
"LlamaModel": _HfExamplesInfo("llama", is_available_online=False),
"MistralModel": _HfExamplesInfo("intfloat/e5-mistral-7b-instruct"),
"ModernBertModel": _HfExamplesInfo("Alibaba-NLP/gte-modernbert-base",
trust_remote_code=True),
trust_remote_code=True, v0_only=True),
"NomicBertModel": _HfExamplesInfo("nomic-ai/nomic-embed-text-v2-moe",
trust_remote_code=True),
trust_remote_code=True, v0_only=True), # noqa: E501
"Qwen2Model": _HfExamplesInfo("ssmits/Qwen2-7B-Instruct-embed-base"),
"Qwen2ForRewardModel": _HfExamplesInfo("Qwen/Qwen2.5-Math-RM-72B"),
"Qwen2ForProcessRewardModel": _HfExamplesInfo("Qwen/Qwen2.5-Math-PRM-7B"),
"Qwen2ForSequenceClassification": _HfExamplesInfo("jason9693/Qwen2.5-1.5B-apeach"), # noqa: E501
"RobertaModel": _HfExamplesInfo("sentence-transformers/stsb-roberta-base-v2"), # noqa: E501
"RobertaForMaskedLM": _HfExamplesInfo("sentence-transformers/all-roberta-large-v1"), # noqa: E501
"XLMRobertaModel": _HfExamplesInfo("intfloat/multilingual-e5-small"),
"RobertaModel": _HfExamplesInfo("sentence-transformers/stsb-roberta-base-v2", v0_only=True), # noqa: E501
"RobertaForMaskedLM": _HfExamplesInfo("sentence-transformers/all-roberta-large-v1", v0_only=True), # noqa: E501
"XLMRobertaModel": _HfExamplesInfo("intfloat/multilingual-e5-small", v0_only=True), # noqa: E501
# [Multimodal]
"LlavaNextForConditionalGeneration": _HfExamplesInfo("royokong/e5-v"),
"Phi3VForCausalLM": _HfExamplesInfo("TIGER-Lab/VLM2Vec-Full",
Expand All @@ -300,10 +300,10 @@ def check_available_online(

_CROSS_ENCODER_EXAMPLE_MODELS = {
# [Text-only]
"BertForSequenceClassification": _HfExamplesInfo("cross-encoder/ms-marco-MiniLM-L-6-v2"), # noqa: E501
"RobertaForSequenceClassification": _HfExamplesInfo("cross-encoder/quora-roberta-base"), # noqa: E501
"XLMRobertaForSequenceClassification": _HfExamplesInfo("BAAI/bge-reranker-v2-m3"), # noqa: E501
"ModernBertForSequenceClassification": _HfExamplesInfo("Alibaba-NLP/gte-reranker-modernbert-base"), # noqa: E501
"BertForSequenceClassification": _HfExamplesInfo("cross-encoder/ms-marco-MiniLM-L-6-v2", v0_only=True), # noqa: E501
"RobertaForSequenceClassification": _HfExamplesInfo("cross-encoder/quora-roberta-base", v0_only=True), # noqa: E501
"XLMRobertaForSequenceClassification": _HfExamplesInfo("BAAI/bge-reranker-v2-m3", v0_only=True), # noqa: E501
"ModernBertForSequenceClassification": _HfExamplesInfo("Alibaba-NLP/gte-reranker-modernbert-base", v0_only=True), # noqa: E501
}

_MULTIMODAL_EXAMPLE_MODELS = {
Expand Down
1 change: 1 addition & 0 deletions tests/tokenization/test_detokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ def _run_incremental_decode(tokenizer,
None,
params,
None,
None,
0.0,
None,
cache_salt=None,
Expand Down
1 change: 1 addition & 0 deletions tests/v1/core/test_kv_cache_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ def make_request(request_id,
multi_modal_hashes=mm_hashes,
multi_modal_placeholders=mm_positions,
sampling_params=SamplingParams(max_tokens=17),
pooling_params=None,
eos_token_id=100,
lora_request=None,
cache_salt=cache_salt,
Expand Down
1 change: 1 addition & 0 deletions tests/v1/core/test_prefix_caching.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ def make_request(request_id,
multi_modal_placeholders=mm_positions,
sampling_params=SamplingParams(max_tokens=17,
prompt_logprobs=prompt_logprobs),
pooling_params=None,
eos_token_id=100,
lora_request=None,
cache_salt=cache_salt,
Expand Down
Loading
Loading