Skip to content

[RFC]: Custom sampling params support in REST API #17191

Open
@afeldman-nm

Description

@afeldman-nm

Update: after incorporating feedback, the updated proposal is described in this comment: #17191 (comment)

Original RFC proposal (outdated):

Motivation

Addresses #16802 (“Support custom args in OpenAI (chat) completion requests”) by adding an “extra” sampling params argument to all endpoints which trigger sampling (completion, chat and transcription). This is ultimately a prerequisite for logits processor support ( RFC: #13360 PR: #16728 ), since logits processors may require custom arguments which are not utilized by vLLM core sampling logic.

Proposed Change.

Here it is proposed that when using the HTTP client, custom sampling arguments may be passed in as key/value pairs via the extra_sampling_params argument

extra_sampling_params: Optional[dict[str, Any]]

#13300 added an extra_args member to SamplingParams

extra_args: Optional[dict[str, Any]] = None

protocol.py defines a class type for each endpoint’s requests. Currently, the arrival of a completion/chat/transcription request at a particular REST API endpoint causes a call to the to_sampling_params() method associated with an instance of the appropriate request class. This method constructs a SamplingParams instance from the request attributes using the from_optional() method; the proposed change is to pass extra_sampling_params to extra_args at that point:

SamplingParams.from_optional(..., extra_args=extra_sampling_params)

In this way, the custom arguments stored in SamplingParams.extra_args will be available to logits processors downstream in the request processing pipeline.

For example,

curl http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "facebook/opt-125m", "prompt": "Say this is a test", “ignore_eos”: true, “extra_sampling_params”: {“custom_arg": <value>}}’

results in a SamplingParams instance with extra_args = {“custom_arg": <value>}.

This RFC only applies to API endpoints which trigger sampling, summarized below (along with their associated request classes in protocols.py):

  • /v1/completions (CompletionRequest)
  • /v1/chat/completions (ChatCompletionRequest)
  • /v1/audio/transcriptions (TranscriptionRequest)

The following API endpoints do not trigger sampling and are not part of this workstream (note that to save time in writing this RFC, I refer to the endpoints in terms of broad categories here):

  • Embeddings (EmbeddingCompletionRequest, EmbeddingChatRequest)
  • Rerank (RerankRequest)
  • Tokenization/Detokenization (TokenizationCompletionRequest, TokenizationChatRequest, DetokenizeRequest)
  • LoRA load (LoadLoRAAdapterRequest) and unload (UnloadLoRAAdapterRequest)

If you are using the OpenAI Python SDK (or similar SDK in another language), the client-side completion/chat/transcription request method does not have an extra_sampling_params argument; extra_sampling_params will need to be passed in as a key-value pair to the extra_body dict argument of the request method. Note that the extra_body argument is not part of the server’s REST API and if you pass extra_body as an argument within an HTTP client request, the server will ignore it. extra_body is simply a “catch-all” argument supported by the Python SDK to handle “special” parameters. Internally, the SDK unpacks extra_body into REST API arguments. The server does not see the extra_body argument.

Under the proposed changes in this PR, the following SDK request exemplifies a correct usage:

   completion = await client.completions.create(model=model_name,
                                                prompt="Hello, my name is",
                                                max_tokens=5,
                                                temperature=0.0,
                                                extra_body={“ignore_eos”: True,
                                                “extra_sampling_params”:           
                                                {“custom_arg”: True})
  • OpenAI-standard API arguments are set directly as arguments to create()
  • Arguments such as ignore_eos are set in extra_body but not in extra_sampling_params, because ignore_eos is an argument defined explicitly in protocols.py and utilized by vLLM’s core sampling functionality
  • custom_arg (which is meant to represent a hypothetical custom argument for a logits processor) is not defined explicitly in any of the request types defined in protocol.py and is therefore packed within extra_sampling_params

Plan for rolling out extra sampling params:

PR #16862 is WIP and does not yet satisfy the specifications below, but will by the time it lands
  • In protocol.py, add an extra_sampling_params member to CompletionRequest, ChatCompletionRequest, and TranscriptionRequest.
  • In each of these three request classes, extra_sampling_params is assigned to SamplingParams.extra_args inside of the to_sampling_params() method as described above.
  • This PR is a prerequisite for near-term work on logits processor support.
  • This PR does not introduce breaking changes.

Thoughts on alternative proposals

The core requirement is that custom sampling arguments are supported, in order to enable the logits processor workstream.

However, in discussions about the API surface area for sampling arguments, one additional proposal was that sampling arguments such as ignore_eos which are not part of the OpenAI API specification, but which are part of the core vLLM sampling implementation (i.e. they are not “custom” logits processor arguments), should be grouped together under a catch-all dict argument (perhaps under extra_sampling_params, or perhaps under a separate dict argument). In other words these would not be top-level arguments, which is currently the case if you use the HTTP client.

Here I suggest that this would add little benefit other than strict-er compliance with the OpenAI API specification, and in exchange would add unnecessary complexity and code changes.

Feedback Period.

1 week

CC List.

@njhill @comaniac @WoosukKwon @simon-mo

CC @robertgshaw2-redhat

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions