Skip to content

[RFC]: Reward Modelling in OpenAI Compatible Server #8967

Closed
@noamgat

Description

@noamgat

Motivation.

Reward models are an important tool in NLP / AI workflows, especially in agentic flows which use them to verify quality of intermediate outputs, or rank between several attempts at performing a single task.

vLLM just added support for a reward model in #8896 (comment) .
This requires a workaround in order to work with the OpenAI Compatible Server - it piggybacks the embedding endpoint.
The workaround requires the client to know which tokenizer is being used by the server, apply the chat template to the conversation, and send the resulting string to the embedding endpoint. This isn't ideal and it breaks the decoupling between the client and server.

A short discussion in the same issue led to the creation of the RFC.

Proposed Change.

The reason that no endpoint currently matches the needs of the reward model is as follows:

  • The embedding endpoint receives a string as the input, not a conversation
  • The chat endpoint returns a string, not a series of numbers. Even if you ask for logprobs, they are after softmax was applied, which is not a reversable process.

I see several ways to more elegantly support reward models in the OpenAI compatible server, and this RFC will hopefully be the discussion point for them.

Option 1:
Add a conversation object (List[Dict[str, str]]) as a potential input to EmbeddingRequest class. It already supports a variety of options:
input: Union[List[int], List[List[int]], str, List[str]]
Upon detecting that a conversation object was given, the OpenAI Compatible server will apply the chat template using the tokenizer, and proceed as if it received str input.

Option 2:
Add a way to get output logits instead of output logprobs from the chat endpoint. This can be either a new per-request parameter (similar to top_logprobs) or a server-side flag to override the behavior of the data returned in the field (--return_logits_instead_of_logprobs flag to the OpenAI server for example).

Option 3:
Add a dedicated endpoint to vLLM.

Option 4:
Do nothing. Since there is a /tokenize endpoint that also accepts a conversation, the sample code in #8896 (comment) could be changed to use the tokenize endpoint, receieve the tokens list and send that the embeddings endpoint, which addresses the coupling problem.

I personally support Option 1, as it feels the least hacky of the bunch, and also does not require a whole lot of new code.

What do you think?

Feedback Period.

Not my decision to say when a conclusion was reached here, but I don't think it should take more than a couple of weeks.

CC List.

@simon-mo
@DarkLight1337
@zhuzilin

Any Other Things.

vLLM is awesome!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions