-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Support OpenAI batch file format #4794
Changes from all commits
e9fbe7e
5d27883
219eb24
a62c606
2f83da0
6f8b35c
d59023a
30e63fe
e02c46f
530fd70
3890872
36a2a95
2d94844
ecb5c5f
d4919ff
b1b29d0
b3eaa49
e159be0
3c26e47
459ea17
1aa656a
bbdd51f
d928050
8dcefe9
2546fbe
d8f20a7
89d059e
d792956
36339ea
0e2bf89
0aee415
6c3a5e2
07f47d8
301d53b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
# Offline Inference with the OpenAI Batch file format | ||
|
||
**NOTE:** This is a guide to performing batch inference using the OpenAI batch file format, **NOT** the complete Batch (REST) API. | ||
|
||
## File Format | ||
|
||
The OpenAI batch file format consists of a series of json objects on new lines. | ||
|
||
[See here for an example file.](https://github.com/vllm-project/vllm/blob/main/examples/openai_example_batch.jsonl) | ||
|
||
Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details. | ||
|
||
**NOTE:** We currently only support to `/v1/chat/completions` endpoint (embeddings and completions coming soon). | ||
|
||
## Pre-requisites | ||
|
||
* Ensure you are using `vllm >= 0.4.3`. You can check by running `python -c "import vllm; print(vllm.__version__)"`. | ||
* The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`. | ||
- Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) | ||
- Install the token on your machine (Run `huggingface-cli login`). | ||
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. | ||
|
||
|
||
## Example: Running with a local file | ||
|
||
### Step 1: Create your batch file | ||
|
||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory. | ||
|
||
``` | ||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl | ||
``` | ||
|
||
Once you've created your batch file it should look like this | ||
|
||
``` | ||
$ cat openai_example_batch.jsonl | ||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
``` | ||
|
||
### Step 2: Run the batch | ||
|
||
The batch running tool is designed to be used from the command line. | ||
|
||
You can run the batch with the following command, which will write its results to a file called `results.jsonl` | ||
|
||
``` | ||
python -m vllm.entrypoints.openai.run_batch -i openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
### Step 3: Check your results | ||
|
||
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl` | ||
|
||
``` | ||
$ cat ../results.jsonl | ||
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null} | ||
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null} | ||
``` | ||
|
||
## Example 2: Using remote files | ||
|
||
The batch runner supports remote input and output urls that are accessible via http/https. | ||
|
||
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl`, you can run | ||
|
||
``` | ||
python -m vllm.entrypoints.openai.run_batch -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
## Example 3: Integrating with AWS S3 | ||
|
||
To integrate with cloud blob storage, we recommend using presigned urls. | ||
|
||
[Learn more about S3 presigned urls here] | ||
|
||
### Additional prerequisites | ||
|
||
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). | ||
* The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3. | ||
- [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). | ||
* The `boto3` python package (Run `pip install boto3`) to generate presigned urls. | ||
|
||
### Step 1: Upload your input script | ||
|
||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory. | ||
|
||
``` | ||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl | ||
``` | ||
|
||
Once you've created your batch file it should look like this | ||
|
||
``` | ||
$ cat openai_example_batch.jsonl | ||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
``` | ||
|
||
Now upload your batch file to your S3 bucket. | ||
|
||
``` | ||
aws s3 cp openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl | ||
``` | ||
|
||
|
||
### Step 2: Generate your presigned urls | ||
|
||
Presigned put urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names. | ||
|
||
(The script is adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py) | ||
|
||
``` | ||
import boto3 | ||
from botocore.exceptions import ClientError | ||
|
||
def generate_presigned_url(s3_client, client_method, method_parameters, expires_in): | ||
""" | ||
Generate a presigned Amazon S3 URL that can be used to perform an action. | ||
|
||
:param s3_client: A Boto3 Amazon S3 client. | ||
:param client_method: The name of the client method that the URL performs. | ||
:param method_parameters: The parameters of the specified client method. | ||
:param expires_in: The number of seconds the presigned URL is valid for. | ||
:return: The presigned URL. | ||
""" | ||
try: | ||
url = s3_client.generate_presigned_url( | ||
ClientMethod=client_method, Params=method_parameters, ExpiresIn=expires_in | ||
) | ||
except ClientError: | ||
raise | ||
return url | ||
|
||
|
||
s3_client = boto3.client("s3") | ||
input_url = generate_presigned_url( | ||
s3_client, "get_object", {"Bucket": "MY_BUCKET", "Key": "MY_INPUT_FILE.jsonl"}, 3600 | ||
) | ||
output_url = generate_presigned_url( | ||
s3_client, "put_object", {"Bucket": "MY_BUCKET", "Key": "MY_OUTPUT_FILE.jsonl"}, 3600 | ||
) | ||
print(f"{input_url=}") | ||
print(f"{output_url=}") | ||
``` | ||
|
||
This script should output | ||
|
||
``` | ||
input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091' | ||
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091' | ||
``` | ||
|
||
### Step 3: Run the batch runner using your presigned urls | ||
|
||
You can now run the batch runner, using the urls generated in the previous section. | ||
|
||
``` | ||
python -m vllm.entrypoints.openai.run_batch \ | ||
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \ | ||
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \ | ||
--model --model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
### Step 4: View your results | ||
|
||
Your results are now on S3. You can view them in your terminal by running | ||
|
||
``` | ||
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl - | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
import subprocess | ||
import sys | ||
import tempfile | ||
|
||
from vllm.entrypoints.openai.protocol import BatchRequestOutput | ||
|
||
# ruff: noqa: E501 | ||
INPUT_BATCH = """{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}""" | ||
|
||
INVALID_INPUT_BATCH = """{"invalid_field": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}""" | ||
|
||
|
||
def test_e2e(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It there any way we can add a test that the input / output formats conform to the openai api There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure |
||
with tempfile.NamedTemporaryFile( | ||
"w") as input_file, tempfile.NamedTemporaryFile( | ||
"r") as output_file: | ||
input_file.write(INPUT_BATCH) | ||
input_file.flush() | ||
proc = subprocess.Popen([ | ||
sys.executable, "-m", "vllm.entrypoints.openai.run_batch", "-i", | ||
input_file.name, "-o", output_file.name, "--model", | ||
"NousResearch/Meta-Llama-3-8B-Instruct" | ||
], ) | ||
proc.communicate() | ||
proc.wait() | ||
assert proc.returncode == 0, f"{proc=}" | ||
|
||
contents = output_file.read() | ||
for line in contents.strip().split("\n"): | ||
# Ensure that the output format conforms to the openai api. | ||
# Validation should throw if the schema is wrong. | ||
BatchRequestOutput.model_validate_json(line) | ||
|
||
|
||
def test_e2e_invalid_input(): | ||
""" | ||
Ensure that we fail when the input doesn't conform to the openai api. | ||
""" | ||
with tempfile.NamedTemporaryFile( | ||
"w") as input_file, tempfile.NamedTemporaryFile( | ||
"r") as output_file: | ||
input_file.write(INVALID_INPUT_BATCH) | ||
input_file.flush() | ||
proc = subprocess.Popen([ | ||
sys.executable, "-m", "vllm.entrypoints.openai.run_batch", "-i", | ||
input_file.name, "-o", output_file.name, "--model", | ||
"NousResearch/Meta-Llama-3-8B-Instruct" | ||
], ) | ||
proc.communicate() | ||
proc.wait() | ||
assert proc.returncode != 0, f"{proc=}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -492,3 +492,44 @@ class ChatCompletionStreamResponse(OpenAIBaseModel): | |
model: str | ||
choices: List[ChatCompletionResponseStreamChoice] | ||
usage: Optional[UsageInfo] = Field(default=None) | ||
|
||
|
||
class BatchRequestInput(OpenAIBaseModel): | ||
""" | ||
The per-line object of the batch input file. | ||
|
||
NOTE: Currently only the `/v1/chat/completions` endpoint is supported. | ||
""" | ||
|
||
# A developer-provided per-request id that will be used to match outputs to | ||
# inputs. Must be unique for each request in a batch. | ||
custom_id: str | ||
|
||
# The HTTP method to be used for the request. Currently only POST is | ||
# supported. | ||
method: str | ||
|
||
# The OpenAI API relative URL to be used for the request. Currently | ||
# /v1/chat/completions is supported. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason we cannot support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [ This could be done in a follow up PR ] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we can support the other 2 endpoints in follow ups. |
||
url: str | ||
|
||
# The parameteters of the request. | ||
body: Union[ChatCompletionRequest, ] | ||
|
||
|
||
class BatchRequestOutput(OpenAIBaseModel): | ||
""" | ||
The per-line object of the batch output and error files | ||
""" | ||
|
||
id: str | ||
|
||
# A developer-provided per-request id that will be used to match outputs to | ||
# inputs. | ||
custom_id: str | ||
|
||
response: Optional[ChatCompletionResponse] | ||
|
||
# For requests that failed with a non-HTTP error, this will contain more | ||
# information on the cause of the failure. | ||
error: Optional[Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples links point to master, so they don't work now, but should work once merged.