forked from mesolitica/vllm-whisper
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Frontend] Support OpenAI batch file format (vllm-project#4794)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
- Loading branch information
1 parent
ac8735f
commit 6643d72
Showing
7 changed files
with
415 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
# Offline Inference with the OpenAI Batch file format | ||
|
||
**NOTE:** This is a guide to performing batch inference using the OpenAI batch file format, **NOT** the complete Batch (REST) API. | ||
|
||
## File Format | ||
|
||
The OpenAI batch file format consists of a series of json objects on new lines. | ||
|
||
[See here for an example file.](https://github.com/vllm-project/vllm/blob/main/examples/openai_example_batch.jsonl) | ||
|
||
Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details. | ||
|
||
**NOTE:** We currently only support to `/v1/chat/completions` endpoint (embeddings and completions coming soon). | ||
|
||
## Pre-requisites | ||
|
||
* Ensure you are using `vllm >= 0.4.3`. You can check by running `python -c "import vllm; print(vllm.__version__)"`. | ||
* The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`. | ||
- Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) | ||
- Install the token on your machine (Run `huggingface-cli login`). | ||
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. | ||
|
||
|
||
## Example: Running with a local file | ||
|
||
### Step 1: Create your batch file | ||
|
||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory. | ||
|
||
``` | ||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl | ||
``` | ||
|
||
Once you've created your batch file it should look like this | ||
|
||
``` | ||
$ cat openai_example_batch.jsonl | ||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
``` | ||
|
||
### Step 2: Run the batch | ||
|
||
The batch running tool is designed to be used from the command line. | ||
|
||
You can run the batch with the following command, which will write its results to a file called `results.jsonl` | ||
|
||
``` | ||
python -m vllm.entrypoints.openai.run_batch -i openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
### Step 3: Check your results | ||
|
||
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl` | ||
|
||
``` | ||
$ cat ../results.jsonl | ||
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null} | ||
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null} | ||
``` | ||
|
||
## Example 2: Using remote files | ||
|
||
The batch runner supports remote input and output urls that are accessible via http/https. | ||
|
||
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl`, you can run | ||
|
||
``` | ||
python -m vllm.entrypoints.openai.run_batch -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
## Example 3: Integrating with AWS S3 | ||
|
||
To integrate with cloud blob storage, we recommend using presigned urls. | ||
|
||
[Learn more about S3 presigned urls here] | ||
|
||
### Additional prerequisites | ||
|
||
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). | ||
* The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3. | ||
- [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). | ||
* The `boto3` python package (Run `pip install boto3`) to generate presigned urls. | ||
|
||
### Step 1: Upload your input script | ||
|
||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory. | ||
|
||
``` | ||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/openai_example_batch.jsonl | ||
``` | ||
|
||
Once you've created your batch file it should look like this | ||
|
||
``` | ||
$ cat openai_example_batch.jsonl | ||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
``` | ||
|
||
Now upload your batch file to your S3 bucket. | ||
|
||
``` | ||
aws s3 cp openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl | ||
``` | ||
|
||
|
||
### Step 2: Generate your presigned urls | ||
|
||
Presigned put urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names. | ||
|
||
(The script is adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py) | ||
|
||
``` | ||
import boto3 | ||
from botocore.exceptions import ClientError | ||
def generate_presigned_url(s3_client, client_method, method_parameters, expires_in): | ||
""" | ||
Generate a presigned Amazon S3 URL that can be used to perform an action. | ||
:param s3_client: A Boto3 Amazon S3 client. | ||
:param client_method: The name of the client method that the URL performs. | ||
:param method_parameters: The parameters of the specified client method. | ||
:param expires_in: The number of seconds the presigned URL is valid for. | ||
:return: The presigned URL. | ||
""" | ||
try: | ||
url = s3_client.generate_presigned_url( | ||
ClientMethod=client_method, Params=method_parameters, ExpiresIn=expires_in | ||
) | ||
except ClientError: | ||
raise | ||
return url | ||
s3_client = boto3.client("s3") | ||
input_url = generate_presigned_url( | ||
s3_client, "get_object", {"Bucket": "MY_BUCKET", "Key": "MY_INPUT_FILE.jsonl"}, 3600 | ||
) | ||
output_url = generate_presigned_url( | ||
s3_client, "put_object", {"Bucket": "MY_BUCKET", "Key": "MY_OUTPUT_FILE.jsonl"}, 3600 | ||
) | ||
print(f"{input_url=}") | ||
print(f"{output_url=}") | ||
``` | ||
|
||
This script should output | ||
|
||
``` | ||
input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091' | ||
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091' | ||
``` | ||
|
||
### Step 3: Run the batch runner using your presigned urls | ||
|
||
You can now run the batch runner, using the urls generated in the previous section. | ||
|
||
``` | ||
python -m vllm.entrypoints.openai.run_batch \ | ||
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \ | ||
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \ | ||
--model --model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
### Step 4: View your results | ||
|
||
Your results are now on S3. You can view them in your terminal by running | ||
|
||
``` | ||
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl - | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
import subprocess | ||
import sys | ||
import tempfile | ||
|
||
from vllm.entrypoints.openai.protocol import BatchRequestOutput | ||
|
||
# ruff: noqa: E501 | ||
INPUT_BATCH = """{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}""" | ||
|
||
INVALID_INPUT_BATCH = """{"invalid_field": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}} | ||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}""" | ||
|
||
|
||
def test_e2e(): | ||
with tempfile.NamedTemporaryFile( | ||
"w") as input_file, tempfile.NamedTemporaryFile( | ||
"r") as output_file: | ||
input_file.write(INPUT_BATCH) | ||
input_file.flush() | ||
proc = subprocess.Popen([ | ||
sys.executable, "-m", "vllm.entrypoints.openai.run_batch", "-i", | ||
input_file.name, "-o", output_file.name, "--model", | ||
"NousResearch/Meta-Llama-3-8B-Instruct" | ||
], ) | ||
proc.communicate() | ||
proc.wait() | ||
assert proc.returncode == 0, f"{proc=}" | ||
|
||
contents = output_file.read() | ||
for line in contents.strip().split("\n"): | ||
# Ensure that the output format conforms to the openai api. | ||
# Validation should throw if the schema is wrong. | ||
BatchRequestOutput.model_validate_json(line) | ||
|
||
|
||
def test_e2e_invalid_input(): | ||
""" | ||
Ensure that we fail when the input doesn't conform to the openai api. | ||
""" | ||
with tempfile.NamedTemporaryFile( | ||
"w") as input_file, tempfile.NamedTemporaryFile( | ||
"r") as output_file: | ||
input_file.write(INVALID_INPUT_BATCH) | ||
input_file.flush() | ||
proc = subprocess.Popen([ | ||
sys.executable, "-m", "vllm.entrypoints.openai.run_batch", "-i", | ||
input_file.name, "-o", output_file.name, "--model", | ||
"NousResearch/Meta-Llama-3-8B-Instruct" | ||
], ) | ||
proc.communicate() | ||
proc.wait() | ||
assert proc.returncode != 0, f"{proc=}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.