[RFC]: Custom sampling params support in REST API

**Update:** after incorporating feedback, the updated proposal is described in this comment: https://github.com/vllm-project/vllm/issues/17191#issuecomment-2858443302

## Original RFC proposal (outdated):


### Motivation

Addresses #16802 (“Support custom args in OpenAI (chat) completion requests”) by adding an “extra” sampling params argument to all endpoints which trigger sampling (completion, chat and transcription). This is ultimately a prerequisite for logits processor support ( RFC: #13360 PR: #16728 ), since logits processors may require custom arguments which are not utilized by vLLM core sampling logic.

### Proposed Change.

Here it is proposed that when using the HTTP client, custom sampling arguments may be passed in as key/value pairs via the `extra_sampling_params` argument

```
extra_sampling_params: Optional[dict[str, Any]]
```

#13300 added an `extra_args` member to `SamplingParams` 

```
extra_args: Optional[dict[str, Any]] = None
```

`protocol.py` defines a class type for each endpoint’s requests. Currently, the arrival of a completion/chat/transcription request at a particular REST API endpoint causes a call to the `to_sampling_params()` method associated with an instance of the appropriate request class. This method constructs a `SamplingParams` instance from the request attributes using the `from_optional()` method; the proposed change is to pass `extra_sampling_params` to `extra_args` at that point:

```
SamplingParams.from_optional(..., extra_args=extra_sampling_params)
```

In this way, the custom arguments stored in `SamplingParams.extra_args` will be available to logits processors downstream in the request processing pipeline.

For example, 

```
curl http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "facebook/opt-125m", "prompt": "Say this is a test", “ignore_eos”: true, “extra_sampling_params”: {“custom_arg": <value>}}’
```

results in a SamplingParams instance with `extra_args = {“custom_arg": <value>}`.

This RFC only applies to API endpoints which trigger sampling, summarized below (along with their associated request classes in `protocols.py`):

* /v1/completions (`CompletionRequest`)
* /v1/chat/completions (`ChatCompletionRequest`)
* /v1/audio/transcriptions (`TranscriptionRequest`)

The following API endpoints do not trigger sampling and are not part of this workstream (note that to save time in writing this RFC, I refer to the endpoints in terms of broad categories here):
* Embeddings (`EmbeddingCompletionRequest`, `EmbeddingChatRequest`)
* Rerank (`RerankRequest`)
* Tokenization/Detokenization (`TokenizationCompletionRequest`, `TokenizationChatRequest`, `DetokenizeRequest`)
* LoRA load (`LoadLoRAAdapterRequest`) and unload (`UnloadLoRAAdapterRequest`)

If you are using the OpenAI Python SDK (or similar SDK in another language), the client-side completion/chat/transcription request method does not have an `extra_sampling_params` argument; `extra_sampling_params` will need to be passed in as a key-value pair to the `extra_body` dict argument of the request method. Note that the `extra_body` argument is not part of the server’s REST API and if you pass `extra_body` as an argument within an HTTP client request, the server will ignore it. `extra_body` is simply a “catch-all” argument supported by the Python SDK to handle “special” parameters. Internally, the SDK unpacks `extra_body` into REST API arguments. The server does not see the `extra_body` argument.

Under the proposed changes in this PR, the following SDK request exemplifies a correct usage:

```
   completion = await client.completions.create(model=model_name,
                                                prompt="Hello, my name is",
                                                max_tokens=5,
                                                temperature=0.0,
                                                extra_body={“ignore_eos”: True,
                                                “extra_sampling_params”:           
                                                {“custom_arg”: True})
```


* OpenAI-standard API arguments are set directly as arguments to `create()`
* Arguments such as `ignore_eos` are set in `extra_body` but *not* in `extra_sampling_params`, because `ignore_eos` is an argument defined explicitly in `protocols.py` and utilized by vLLM’s core sampling functionality
* `custom_arg` (which is meant to represent a hypothetical custom argument for a logits processor) is not defined explicitly in any of the request types defined in `protocol.py` and is therefore packed within `extra_sampling_params`

#### Plan for rolling out extra sampling params:

##### PR #16862 is WIP and does not yet satisfy the specifications below, but will by the time it lands
* In `protocol.py`,  add an `extra_sampling_params` member to `CompletionRequest`, `ChatCompletionRequest`, and `TranscriptionRequest`. 
* In each of these three request classes, `extra_sampling_params` is assigned to `SamplingParams.extra_args` inside of the `to_sampling_params()` method  as described above.
* This PR is a prerequisite for near-term work on logits processor support.
* This PR does not introduce breaking changes.

#### Thoughts on alternative proposals

The core requirement is that custom sampling arguments are supported, in order to enable the logits processor workstream.

However, in discussions about the API surface area for sampling arguments, one additional proposal was that sampling arguments such as `ignore_eos` which are *not* part of the OpenAI API specification, but which are part of the core vLLM sampling implementation (i.e. they are not “custom” logits processor arguments), should be grouped together under a catch-all dict argument (perhaps under `extra_sampling_params`, or perhaps under a separate dict argument). In other words these would not be top-level arguments, which is currently the case if you use the HTTP client.

Here I suggest that this would add little benefit other than strict-*er* compliance with the OpenAI API specification, and in exchange would add unnecessary complexity and code changes.


### Feedback Period.

1 week

### CC List.

@njhill @comaniac @WoosukKwon @simon-mo 

CC @robertgshaw2-redhat 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Custom sampling params support in REST API #17191

Original RFC proposal (outdated):

Motivation

Proposed Change.

Plan for rolling out extra sampling params:

PR #16862 is WIP and does not yet satisfy the specifications below, but will by the time it lands

Thoughts on alternative proposals

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Custom sampling params support in REST API #17191

Description

Original RFC proposal (outdated):

Motivation

Proposed Change.

Plan for rolling out extra sampling params:

PR #16862 is WIP and does not yet satisfy the specifications below, but will by the time it lands

Thoughts on alternative proposals

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions