Description
Update: after incorporating feedback, the updated proposal is described in this comment: #17191 (comment)
Original RFC proposal (outdated):
Motivation
Addresses #16802 (“Support custom args in OpenAI (chat) completion requests”) by adding an “extra” sampling params argument to all endpoints which trigger sampling (completion, chat and transcription). This is ultimately a prerequisite for logits processor support ( RFC: #13360 PR: #16728 ), since logits processors may require custom arguments which are not utilized by vLLM core sampling logic.
Proposed Change.
Here it is proposed that when using the HTTP client, custom sampling arguments may be passed in as key/value pairs via the extra_sampling_params
argument
extra_sampling_params: Optional[dict[str, Any]]
#13300 added an extra_args
member to SamplingParams
extra_args: Optional[dict[str, Any]] = None
protocol.py
defines a class type for each endpoint’s requests. Currently, the arrival of a completion/chat/transcription request at a particular REST API endpoint causes a call to the to_sampling_params()
method associated with an instance of the appropriate request class. This method constructs a SamplingParams
instance from the request attributes using the from_optional()
method; the proposed change is to pass extra_sampling_params
to extra_args
at that point:
SamplingParams.from_optional(..., extra_args=extra_sampling_params)
In this way, the custom arguments stored in SamplingParams.extra_args
will be available to logits processors downstream in the request processing pipeline.
For example,
curl http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "facebook/opt-125m", "prompt": "Say this is a test", “ignore_eos”: true, “extra_sampling_params”: {“custom_arg": <value>}}’
results in a SamplingParams instance with extra_args = {“custom_arg": <value>}
.
This RFC only applies to API endpoints which trigger sampling, summarized below (along with their associated request classes in protocols.py
):
- /v1/completions (
CompletionRequest
) - /v1/chat/completions (
ChatCompletionRequest
) - /v1/audio/transcriptions (
TranscriptionRequest
)
The following API endpoints do not trigger sampling and are not part of this workstream (note that to save time in writing this RFC, I refer to the endpoints in terms of broad categories here):
- Embeddings (
EmbeddingCompletionRequest
,EmbeddingChatRequest
) - Rerank (
RerankRequest
) - Tokenization/Detokenization (
TokenizationCompletionRequest
,TokenizationChatRequest
,DetokenizeRequest
) - LoRA load (
LoadLoRAAdapterRequest
) and unload (UnloadLoRAAdapterRequest
)
If you are using the OpenAI Python SDK (or similar SDK in another language), the client-side completion/chat/transcription request method does not have an extra_sampling_params
argument; extra_sampling_params
will need to be passed in as a key-value pair to the extra_body
dict argument of the request method. Note that the extra_body
argument is not part of the server’s REST API and if you pass extra_body
as an argument within an HTTP client request, the server will ignore it. extra_body
is simply a “catch-all” argument supported by the Python SDK to handle “special” parameters. Internally, the SDK unpacks extra_body
into REST API arguments. The server does not see the extra_body
argument.
Under the proposed changes in this PR, the following SDK request exemplifies a correct usage:
completion = await client.completions.create(model=model_name,
prompt="Hello, my name is",
max_tokens=5,
temperature=0.0,
extra_body={“ignore_eos”: True,
“extra_sampling_params”:
{“custom_arg”: True})
- OpenAI-standard API arguments are set directly as arguments to
create()
- Arguments such as
ignore_eos
are set inextra_body
but not inextra_sampling_params
, becauseignore_eos
is an argument defined explicitly inprotocols.py
and utilized by vLLM’s core sampling functionality custom_arg
(which is meant to represent a hypothetical custom argument for a logits processor) is not defined explicitly in any of the request types defined inprotocol.py
and is therefore packed withinextra_sampling_params
Plan for rolling out extra sampling params:
PR #16862 is WIP and does not yet satisfy the specifications below, but will by the time it lands
- In
protocol.py
, add anextra_sampling_params
member toCompletionRequest
,ChatCompletionRequest
, andTranscriptionRequest
. - In each of these three request classes,
extra_sampling_params
is assigned toSamplingParams.extra_args
inside of theto_sampling_params()
method as described above. - This PR is a prerequisite for near-term work on logits processor support.
- This PR does not introduce breaking changes.
Thoughts on alternative proposals
The core requirement is that custom sampling arguments are supported, in order to enable the logits processor workstream.
However, in discussions about the API surface area for sampling arguments, one additional proposal was that sampling arguments such as ignore_eos
which are not part of the OpenAI API specification, but which are part of the core vLLM sampling implementation (i.e. they are not “custom” logits processor arguments), should be grouped together under a catch-all dict argument (perhaps under extra_sampling_params
, or perhaps under a separate dict argument). In other words these would not be top-level arguments, which is currently the case if you use the HTTP client.
Here I suggest that this would add little benefit other than strict-er compliance with the OpenAI API specification, and in exchange would add unnecessary complexity and code changes.
Feedback Period.
1 week
CC List.
@njhill @comaniac @WoosukKwon @simon-mo
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.