Skip to content

[Feature]: Support the OpenAI Batch Chat Completions file format #4777

Closed
@wuisawesome

Description

@wuisawesome

🚀 The feature, motivation and pitch

I'm working on a use case that involves running a the same dataset/prompt across multiple models (including some OpenAI and some open source models). I would like to be able to do batch inference on many requests in a file that follow the OpenAI Batch file format.

  1. This follows the spirit/pattern of the popular openai api server interface for using vllm.
  2. It is easy to adapt existing code that calls web endpoints to generate these files (since the body field is essentially what you would pass into web endpoint).
  3. This format doesn't require the user to think about rate limits, parallelism, etc.

I'll lay out an implementation plan here, which I'm willing to contribute an implementation for.

Interface

The primary interface would be via CLI command.

$ python -m vllm.entrypoints.openai_batch --help
Usage: openai_batch [OPTIONS]

  Run offline inference on a file which conforms to the OpenAI Batch file format. https://platform.openai.com/docs/guides/batch/getting-started

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.
  -i
  --input-file The path or url to a single input file. Currently supports local file paths, or the http protocol (http or https). If a URL is specified, the file should be available via HTTP GET.
  -o
  --output-file The path or url to a single output file. Currently supports local file paths, or web (http or https) urls. If a URL is specified, the file should be available via HTTP PUT.

Exit status:
  0 No problems occured.
  1 Generic error code.

Implementation

The feature of this feature should be a fairly independent wrapper around the vLLM core, and the implementation shouldn't be very involved. I propose only a very minor cleanup to OpenAIServingChat in which OpenAIServingChat::create_chat_completion function signature will change to (the only change is the raw_request is removed):

    async def create_chat_completion(
        self, request: ChatCompletionRequest, is_aborted : Optional[Callable[[], Awaitable[bool]] = None
    ) -> Union[ErrorResponse, AsyncGenerator[str, None], ChatCompletionResponse]:

The only caller of the function (api_server.py) will simply pass is_aborted = raw_request.is_disconnected.

The rest of the implementation will be:

  1. Create a new pydantic model for the request.
  2. Load the local file or url into a list of request objects.
  3. Submit the requests to the openai_serving_chat.
  4. Write the outputs.

Alternatives

It's possible to simply instantiate the api server and manage parallelizing requests today, but it adds complication to user code since request parallelization is important and difficult to get right and ensuring the http server's port is available isn't always trivial in failure conditions.

Alternative apis:

  • Python API: Nothing in this proposal prevents or makes it more difficult to introduce a python api later, though there are additional interface decisions to make (async vs synchronous, exclusive vs shared engine, etc).
  • REST API: The openai REST API using an input_file_id field. It's not obvious to me how implementing this field in the same way as openai (with validation, etc) should interact with the rest of the vLLM project, so I will leave it out of scope for now (perhaps others can chip in if this is a desired feature).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions