Skip to content

Support Cross encoder models #10400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
3091c09
Add support for cross encoders
maxdebayser Nov 17, 2024
4f4d4be
Merge branch 'main' into cross_encoder
maxdebayser Nov 17, 2024
eadc8ed
Add support for roberta models, including BAAI/bge-reranker-v2-m3
maxdebayser Nov 17, 2024
b6a0092
remove task cross_encoding
maxdebayser Nov 17, 2024
5b17c70
address review comments, fix bug, clean up diff
maxdebayser Nov 18, 2024
1f02dfa
add cpu support
maxdebayser Nov 18, 2024
7d63ed1
Add a score() method top the LLM entrypoint
maxdebayser Nov 18, 2024
61e72c8
raise exception in case of MistralTokenizer
maxdebayser Nov 18, 2024
ecc3d10
Add tests for the LLM.score() method
maxdebayser Nov 19, 2024
6e3e654
refactor common code
maxdebayser Nov 19, 2024
4a797e8
do the AMD lazy loading thing
maxdebayser Nov 19, 2024
a90c408
load activation functions only from pytorch
maxdebayser Nov 19, 2024
9821f01
positional-only arguments
maxdebayser Nov 19, 2024
1045485
Merge branch 'upstream_main' into cross_encoder
maxdebayser Nov 20, 2024
62bfeaf
Adds /v1/score route
flaviabeo Nov 19, 2024
4108ad8
Adds score unit tests
flaviabeo Nov 21, 2024
910142b
fix unused code
maxdebayser Nov 21, 2024
a0aceb1
fix access to None in error handler
maxdebayser Nov 21, 2024
913451b
fix test code
maxdebayser Nov 21, 2024
8402a62
Merge branch 'main' into cross_encoder
maxdebayser Nov 22, 2024
43ecdcc
use resolve_obj_by_qualname
maxdebayser Nov 22, 2024
9d00b14
verify cross encoder
maxdebayser Nov 22, 2024
250436a
add registry tests
maxdebayser Nov 22, 2024
7191140
Adds API usage example at the docs
flaviabeo Nov 22, 2024
a374f79
Merge test registry + fix example lint
flaviabeo Nov 22, 2024
09d4ca6
yapf disble - conflicts with isort
flaviabeo Nov 22, 2024
024837b
Add Cross Encoders score API docs for OpenAI compatible page
flaviabeo Nov 22, 2024
cbc7364
Merge branch 'upstream_main' into cross_encoder
maxdebayser Nov 22, 2024
bc9ddc1
Merge branch 'upstream_main' into cross_encoder
maxdebayser Nov 24, 2024
e1e2f40
Moves score API section upMoves score API section upMoves score API s…
flaviabeo Nov 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,148 @@ We currently support the following OpenAI APIs:
- This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*

## Score API for Cross Encoder Models

vLLM supports *cross encoders models* at the **/v1/score** endpoint, which is not an OpenAI API standard endpoint. You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

A ***Cross Encoder*** takes exactly two sentences / texts as input and either predicts a score or label for this sentence pair. It can for example predict the similarity of the sentence pair on a scale of 0 … 1.

### Example of usage for a pair of a string and a list of texts

In this case, the model will compare the first given text to each of the texts containing the list.

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"text_1": "What is the capital of France?",
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```

Response:

```bash
{
"id": "score-request-id",
"object": "list",
"created": 693570,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": [
0.001094818115234375
]
},
{
"index": 1,
"object": "score",
"score": [
1
]
}
],
"usage": {}
}
```

### Example of usage for a pair of two lists of texts

In this case, the model will compare the one by one, making pairs by same index correspondent in each list.

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```

Response:

```bash
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": [
1
]
},
{
"index": 1,
"object": "score",
"score": [
1
]
}
],
"usage": {}
}
```

### Example of usage for a pair of two strings

In this case, the model will compare the strings of texts.

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": "What is the capital of France?",
"text_2": "The capital of France is Paris."
}'
```

Response:

```bash
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": [
1
]
}
],
"usage": {}
}
```

## Extra Parameters

vLLM supports a set of parameters that are not part of the OpenAI API.
Expand Down
58 changes: 58 additions & 0 deletions examples/openai_cross_encoder_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
"""Examples Python client Score for Cross Encoder Models
"""

import argparse
import json
import pprint

import requests


def post_http_request(prompt: json, api_url: str) -> requests.Response:
headers = {"User-Agent": "Test Client"}
response = requests.post(api_url, headers=headers, json=prompt)
return response


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--model", type=str, default="BAAI/bge-reranker-v2-m3")
args = parser.parse_args()
api_url = f"http://{args.host}:{args.port}/v1/score"

model_name = args.model

text_1 = "What is the capital of France?"
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]
prompt = {"model": model_name, "text_1": text_1, "text_2": text_2}
score_response = post_http_request(prompt=prompt, api_url=api_url)
print("Prompt for text_1 is string and text_2 is a list:")
pprint.pprint(prompt)
print("Score Response:")
pprint.pprint(score_response.data)

text_1 = [
"What is the capital of Brazil?", "What is the capital of France?"
]
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]
prompt = {"model": model_name, "text_1": text_1, "text_2": text_2}
score_response = post_http_request(prompt=prompt, api_url=api_url)
print("Prompt for text_1 and text_2 are lists:")
pprint.pprint(prompt)
print("Score Response:")
pprint.pprint(score_response.data)

text_1 = "What is the capital of Brazil?"
text_2 = "The capital of Brazil is Brasilia."
prompt = {"model": model_name, "text_1": text_1, "text_2": text_2}
score_response = post_http_request(prompt=prompt, api_url=api_url)
print("Prompt for text_1 and text_2 are strings:")
pprint.pprint(prompt)
print("Score Response:")
pprint.pprint(score_response.data)
20 changes: 20 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ def __init__(
model_kwargs: Optional[Dict[str, Any]] = None,
is_embedding_model: bool = False,
is_sentence_transformer: bool = False,
is_cross_encoder: bool = False,
skip_tokenizer_init: bool = False,
auto_cls: Type[_BaseAutoModelClass] = AutoModelForCausalLM,
postprocess_inputs: Callable[..., BatchEncoding] = identity,
Expand All @@ -282,6 +283,14 @@ def __init__(
device="cpu",
trust_remote_code=True,
).to(dtype=torch_dtype))
elif is_cross_encoder:
# Lazy init required for AMD CI
from sentence_transformers import CrossEncoder
self.model = CrossEncoder(model_name,
device="cpu",
trust_remote_code=True)
self.model.model = self.wrap_device(self.model.model)\
.to(dtype=torch_dtype)
else:
model_kwargs = model_kwargs if model_kwargs is not None else {}
self.model = self.wrap_device(
Expand Down Expand Up @@ -625,6 +634,9 @@ def generate_encoder_decoder_greedy_logprobs_limit(
def encode(self, prompts: List[str]) -> List[List[torch.Tensor]]:
return self.model.encode(prompts)

def predict(self, prompts: List[List[str]]) -> torch.Tensor:
return self.model.predict(prompts, convert_to_tensor=True)

def __enter__(self):
return self

Expand Down Expand Up @@ -898,6 +910,14 @@ def encode(
req_outputs = self.model.encode(inputs)
return [req_output.outputs.embedding for req_output in req_outputs]

def score(
self,
text_1: Union[str, List[str]],
text_2: Union[str, List[str]],
) -> List[List[float]]:
req_outputs = self.model.score(text_1, text_2)
return [req_output.outputs.embedding for req_output in req_outputs]

def __enter__(self):
return self

Expand Down
93 changes: 93 additions & 0 deletions tests/entrypoints/openai/test_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import pytest
import requests

from vllm.entrypoints.openai.protocol import ScoreResponse

from ...utils import RemoteOpenAIServer

MODEL_NAME = "BAAI/bge-reranker-v2-m3"


@pytest.fixture(scope="module")
def server():
args = [
"--enforce-eager",
]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_text_1_str_text_2_list(server: RemoteOpenAIServer,
model_name: str):
text_1 = "What is the capital of France?"
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("v1/score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2
assert score.data[0].score[0] <= 0.01
assert score.data[1].score[0] >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_text_1_list_text_2_list(server: RemoteOpenAIServer,
model_name: str):
text_1 = [
"What is the capital of the United States?",
"What is the capital of France?"
]
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("v1/score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2
assert score.data[0].score[0] <= 0.01
assert score.data[1].score[0] >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_text_1_str_text_2_str(server: RemoteOpenAIServer,
model_name: str):
text_1 = "What is the capital of France?"
text_2 = "The capital of France is Paris."

score_response = requests.post(server.url_for("v1/score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 1
assert score.data[0].score[0] >= 0.9
Loading