You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on a use case that involves running a the same dataset/prompt across multiple models (including some OpenAI and some open source models). I would like to be able to do batch inference on many requests in a file that follow the OpenAI Batch file format.
This follows the spirit/pattern of the popular openai api server interface for using vllm.
It is easy to adapt existing code that calls web endpoints to generate these files (since the body field is essentially what you would pass into web endpoint).
This format doesn't require the user to think about rate limits, parallelism, etc.
I'll lay out an implementation plan here, which I'm willing to contribute an implementation for.
Interface
The primary interface would be via CLI command.
$ python -m vllm.entrypoints.openai_batch --help
Usage: openai_batch [OPTIONS]
Run offline inference on a file which conforms to the OpenAI Batch file format. https://platform.openai.com/docs/guides/batch/getting-started
Options:
--version Show the version and exit.
--help Show this message and exit.
-i
--input-file The path or url to a single input file. Currently supports local file paths, or the http protocol (http or https). If a URL is specified, the file should be available via HTTP GET.
-o
--output-file The path or url to a single output file. Currently supports local file paths, or web (http or https) urls. If a URL is specified, the file should be available via HTTP PUT.
Exit status:
0 No problems occured.
1 Generic error code.
Implementation
The feature of this feature should be a fairly independent wrapper around the vLLM core, and the implementation shouldn't be very involved. I propose only a very minor cleanup to OpenAIServingChat in which OpenAIServingChat::create_chat_completion function signature will change to (the only change is the raw_request is removed):
The only caller of the function (api_server.py) will simply pass is_aborted = raw_request.is_disconnected.
The rest of the implementation will be:
Create a new pydantic model for the request.
Load the local file or url into a list of request objects.
Submit the requests to the openai_serving_chat.
Write the outputs.
Alternatives
It's possible to simply instantiate the api server and manage parallelizing requests today, but it adds complication to user code since request parallelization is important and difficult to get right and ensuring the http server's port is available isn't always trivial in failure conditions.
Alternative apis:
Python API: Nothing in this proposal prevents or makes it more difficult to introduce a python api later, though there are additional interface decisions to make (async vs synchronous, exclusive vs shared engine, etc).
REST API: The openai REST API using an input_file_id field. It's not obvious to me how implementing this field in the same way as openai (with validation, etc) should interact with the rest of the vLLM project, so I will leave it out of scope for now (perhaps others can chip in if this is a desired feature).
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
I'm working on a use case that involves running a the same dataset/prompt across multiple models (including some OpenAI and some open source models). I would like to be able to do batch inference on many requests in a file that follow the OpenAI Batch file format.
body
field is essentially what you would pass into web endpoint).I'll lay out an implementation plan here, which I'm willing to contribute an implementation for.
Interface
The primary interface would be via CLI command.
Implementation
The feature of this feature should be a fairly independent wrapper around the vLLM core, and the implementation shouldn't be very involved. I propose only a very minor cleanup to
OpenAIServingChat
in whichOpenAIServingChat::create_chat_completion
function signature will change to (the only change is theraw_request
is removed):The only caller of the function (
api_server.py
) will simply passis_aborted = raw_request.is_disconnected
.The rest of the implementation will be:
Alternatives
It's possible to simply instantiate the api server and manage parallelizing requests today, but it adds complication to user code since request parallelization is important and difficult to get right and ensuring the http server's port is available isn't always trivial in failure conditions.
Alternative apis:
input_file_id
field. It's not obvious to me how implementing this field in the same way as openai (with validation, etc) should interact with the rest of the vLLM project, so I will leave it out of scope for now (perhaps others can chip in if this is a desired feature).Additional context
No response
The text was updated successfully, but these errors were encountered: