[Feature]: Support the OpenAI Batch Chat Completions file format

### 🚀 The feature, motivation and pitch

I'm working on a use case that involves running a the same dataset/prompt across multiple models (including some OpenAI and some open source models). I would like to be able to do batch inference on many requests in a file that follow the [OpenAI Batch file format](https://platform.openai.com/docs/guides/batch/1-preparing-your-batch-file).
1. This follows the spirit/pattern of the popular openai api server interface for using vllm.
2. It is easy to adapt existing code that calls web endpoints to generate these files (since the `body` field is essentially what you would pass into web endpoint).
3. This format doesn't require the user to think about rate limits, parallelism, etc.

I'll lay out an implementation plan here, which **I'm willing to contribute an implementation for**.

# Interface

The primary interface would be via CLI command.

```
$ python -m vllm.entrypoints.openai_batch --help
Usage: openai_batch [OPTIONS]

  Run offline inference on a file which conforms to the OpenAI Batch file format. https://platform.openai.com/docs/guides/batch/getting-started

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.
  -i
  --input-file The path or url to a single input file. Currently supports local file paths, or the http protocol (http or https). If a URL is specified, the file should be available via HTTP GET.
  -o
  --output-file The path or url to a single output file. Currently supports local file paths, or web (http or https) urls. If a URL is specified, the file should be available via HTTP PUT.

Exit status:
  0 No problems occured.
  1 Generic error code.
```

# Implementation

The feature of this feature should be a fairly independent wrapper around the vLLM core, and the implementation shouldn't be very involved. I propose only a very minor cleanup to `OpenAIServingChat` in which `OpenAIServingChat::create_chat_completion` function signature will change to (the only change is the `raw_request` is removed):

```
    async def create_chat_completion(
        self, request: ChatCompletionRequest, is_aborted : Optional[Callable[[], Awaitable[bool]] = None
    ) -> Union[ErrorResponse, AsyncGenerator[str, None], ChatCompletionResponse]:
```

The only caller of the function (`api_server.py`) will simply pass `is_aborted = raw_request.is_disconnected`.

The rest of the implementation will be:
1. Create a new pydantic model for the request.
2. Load the local file or url into a list of request objects.
3. Submit the requests to the openai_serving_chat.
4. Write the outputs.

### Alternatives

It's possible to simply instantiate the api server and manage parallelizing requests today, but it adds complication to user code since request parallelization is important and difficult to get right and ensuring the http server's port is available isn't always trivial in failure conditions.

Alternative apis:
* Python API: Nothing in this proposal prevents or makes it more difficult to introduce a python api later, though there are additional interface decisions to make (async vs synchronous, exclusive vs shared engine, etc).
* REST API: The openai REST API using an `input_file_id` field. It's not obvious to me how implementing this field in the same way as openai (with validation, etc) should interact with the rest of the vLLM project, so I will leave it out of scope for now (perhaps others can chip in if this is a desired feature). 

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Support the OpenAI Batch Chat Completions file format #4777

🚀 The feature, motivation and pitch

Interface

Implementation

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support the OpenAI Batch Chat Completions file format #4777

Description

🚀 The feature, motivation and pitch

Interface

Implementation

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions