Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Add OpenAI Vision API Support #5237

Merged
merged 54 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
8ba11d4
initial
ywang96 Jun 2, 2024
d361d20
iterate
ywang96 Jun 3, 2024
fd5aba5
Merge branch 'main' into gpt4v-fe
ywang96 Jun 3, 2024
730cda7
iterate
ywang96 Jun 3, 2024
1c0b89d
iterate
ywang96 Jun 4, 2024
520f5a0
iterate
ywang96 Jun 4, 2024
3a57a6d
iterate
ywang96 Jun 4, 2024
31b941b
adding test
ywang96 Jun 4, 2024
9b3cf48
iterate
ywang96 Jun 4, 2024
af94f8c
docstring
ywang96 Jun 4, 2024
332dd10
remove unused lib
ywang96 Jun 4, 2024
d52a907
revert hardcoded chat template
ywang96 Jun 4, 2024
58746fc
address feedback
ywang96 Jun 4, 2024
99d9197
update pytestmark
ywang96 Jun 4, 2024
0b65271
apply asyncio mark
ywang96 Jun 4, 2024
3a965d9
update doc
ywang96 Jun 4, 2024
f9b9707
update test
ywang96 Jun 5, 2024
04ebbf7
minor doc update
ywang96 Jun 5, 2024
0cdd54f
minor doc update
ywang96 Jun 5, 2024
82a0052
Clarify experiment support
ywang96 Jun 5, 2024
dd01246
note regarding prompt format when using API server
ywang96 Jun 5, 2024
e40da86
Merge branch 'main' into gpt4v-fe
ywang96 Jun 5, 2024
088ad81
fix typo
ywang96 Jun 5, 2024
daa7085
update template
ywang96 Jun 5, 2024
1b32e2f
revert and update token count
ywang96 Jun 5, 2024
c45b34e
update template
ywang96 Jun 5, 2024
d6c1322
update
ywang96 Jun 5, 2024
05fe635
update
ywang96 Jun 5, 2024
938e5c9
template format
ywang96 Jun 5, 2024
b9318bc
correct and add test for multi image
ywang96 Jun 5, 2024
199ced7
fix test
ywang96 Jun 5, 2024
9e686e0
Add unit test for `fetch_image`
DarkLight1337 Jun 5, 2024
d9fbb17
Apply formatter
DarkLight1337 Jun 5, 2024
2833ba0
address feedback
ywang96 Jun 6, 2024
6c365bd
fix notes
ywang96 Jun 6, 2024
26c38f1
use aiohttp
ywang96 Jun 6, 2024
734e50b
fix test
ywang96 Jun 6, 2024
0cd2931
test
ywang96 Jun 6, 2024
561f07f
fix test
ywang96 Jun 6, 2024
481fea8
update test
ywang96 Jun 6, 2024
9585cc6
update fixture
ywang96 Jun 6, 2024
7f9500d
fix field
ywang96 Jun 6, 2024
32d1a25
fix field
ywang96 Jun 6, 2024
1e665b7
format
ywang96 Jun 6, 2024
cce804e
fix image loading
ywang96 Jun 6, 2024
31b219c
revert change that merges fetch and parse
ywang96 Jun 6, 2024
dcf8c8d
add encoded image fixture
ywang96 Jun 6, 2024
a9a9712
Merge branch 'main' into gpt4v-fe
ywang96 Jun 6, 2024
89a452a
update fetch image and remove unused fixture
ywang96 Jun 6, 2024
4e3eca9
cleanup
ywang96 Jun 6, 2024
afadfac
fix fixture
ywang96 Jun 6, 2024
d3bae73
remove unused client close
ywang96 Jun 6, 2024
a149368
add TODO and format
ywang96 Jun 6, 2024
72d4bc4
address comment
ywang96 Jun 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 59 additions & 1 deletion docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Using VLMs
==========

This document shows you how to run and serve Vision Language Models (VLMs) using vLLM.
vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM.

Engine Arguments
----------------
Expand Down Expand Up @@ -54,3 +54,61 @@ For now, we only support a single image per text prompt. To pass an image to the
print(generated_text)

A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

Online OpenAI Vision API Compatible Inference
----------------------------------------------

You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.

.. note::
Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be
added in the future.

Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server.

.. important::
Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template
is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the
HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.

.. code-block:: bash

python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--image-input-type pixel_values \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja

To consume the server, you can use the OpenAI client like in the example below:

.. code-block:: python

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}],
)
print("Chat response:", chat_response)

.. note::
The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be
processed automatically by the server.
4 changes: 3 additions & 1 deletion docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-refer
- Chat: `tools`, and `tool_choice`.
- Completions: `suffix`.

vLLM also provides experimental support for OpenAI Vision API compatible inference. See more details in [Using VLMs](../models/vlm.rst).

## Extra Parameters
vLLM supports a set of parameters that are not part of the OpenAI API.
In order to use them, you can pass them as extra parameters in the OpenAI client.
Expand Down Expand Up @@ -120,4 +122,4 @@ It is the callers responsibility to prompt the model with the tool information,

vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.

Please refer to the OpenAI API reference documentation for more information.
Please refer to the OpenAI API reference documentation for more information.
23 changes: 23 additions & 0 deletions examples/template_llava.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{%- if messages[0]['role'] == 'system' -%}
{%- set system_message = messages[0]['content'] -%}
{%- set messages = messages[1:] -%}
{%- else -%}
{% set system_message = '' -%}
{%- endif -%}

{{ bos_token + system_message }}
{%- for message in messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{%- endif -%}

{%- if message['role'] == 'user' -%}
{{ 'USER: ' + message['content'] + '\n' }}
{%- elif message['role'] == 'assistant' -%}
{{ 'ASSISTANT: ' + message['content'] + eos_token + '\n' }}
{%- endif -%}
{%- endfor -%}

{%- if add_generation_prompt -%}
{{ 'ASSISTANT:' }}
{% endif %}
274 changes: 274 additions & 0 deletions tests/entrypoints/test_openai_vision.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
from pathlib import Path

import openai
import pytest
import ray

from vllm.multimodal.utils import encode_image_base64, fetch_image

from ..utils import ServerRunner

MODEL_NAME = "llava-hf/llava-1.5-7b-hf"
LLAVA_CHAT_TEMPLATE = (Path(__file__).parent.parent.parent /
"examples/template_llava.jinja")
assert LLAVA_CHAT_TEMPLATE.exists()
# Test different image extensions (JPG/PNG) and formats (gray/RGB/RGBA)
TEST_IMAGE_URLS = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png",
"https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Venn_diagram_rgb.svg/1280px-Venn_diagram_rgb.svg.png",
"https://upload.wikimedia.org/wikipedia/commons/0/0b/RGBA_comp.png",
]

pytestmark = pytest.mark.openai


@pytest.fixture(scope="module")
def server():
ray.init()
server_runner = ServerRunner.remote([
"--model",
MODEL_NAME,
"--dtype",
"bfloat16",
"--max-model-len",
"4096",
"--enforce-eager",
"--image-input-type",
"pixel_values",
"--image-token-id",
"32000",
"--image-input-shape",
"1,3,336,336",
"--image-feature-size",
"576",
"--chat-template",
str(LLAVA_CHAT_TEMPLATE),
])
ray.get(server_runner.ready.remote())
yield server_runner
ray.shutdown()


@pytest.fixture(scope="session")
def client():
client = openai.AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
yield client


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS)
async def test_single_chat_session_image(server, client: openai.AsyncOpenAI,
model_name: str, image_url: str):
messages = [{
"role":
"user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "text",
"text": "What's in this image?"
},
],
}]

# test single completion
chat_completion = await client.chat.completions.create(model=model_name,
messages=messages,
max_tokens=10,
logprobs=True,
top_logprobs=5)
assert len(chat_completion.choices) == 1

choice = chat_completion.choices[0]
assert choice.finish_reason == "length"
assert chat_completion.usage == openai.types.CompletionUsage(
completion_tokens=10, prompt_tokens=596, total_tokens=606)

message = choice.message
message = chat_completion.choices[0].message
assert message.content is not None and len(message.content) >= 10
assert message.role == "assistant"
messages.append({"role": "assistant", "content": message.content})

# test multi-turn dialogue
messages.append({"role": "user", "content": "express your result in json"})
chat_completion = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
)
message = chat_completion.choices[0].message
assert message.content is not None and len(message.content) >= 0


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS)
async def test_single_chat_session_image_base64encoded(
server, client: openai.AsyncOpenAI, model_name: str, image_url: str):

image_encoded = encode_image_base64(fetch_image(image_url=image_url))
messages = [{
"role":
"user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_encoded}"
}
},
{
"type": "text",
"text": "What's in this image?"
},
],
}]

# test single completion
chat_completion = await client.chat.completions.create(model=model_name,
messages=messages,
max_tokens=10,
logprobs=True,
top_logprobs=5)
assert len(chat_completion.choices) == 1

choice = chat_completion.choices[0]
assert choice.finish_reason == "length"
assert chat_completion.usage == openai.types.CompletionUsage(
completion_tokens=10, prompt_tokens=596, total_tokens=606)

message = choice.message
message = chat_completion.choices[0].message
assert message.content is not None and len(message.content) >= 10
assert message.role == "assistant"
messages.append({"role": "assistant", "content": message.content})

# test multi-turn dialogue
messages.append({"role": "user", "content": "express your result in json"})
chat_completion = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
)
message = chat_completion.choices[0].message
assert message.content is not None and len(message.content) >= 0


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS)
async def test_chat_streaming_image(server, client: openai.AsyncOpenAI,
model_name: str, image_url: str):
messages = [{
"role":
"user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "text",
"text": "What's in this image?"
},
],
}]

# test single completion
chat_completion = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
temperature=0.0,
)
output = chat_completion.choices[0].message.content
stop_reason = chat_completion.choices[0].finish_reason

# test streaming
stream = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
temperature=0.0,
stream=True,
)
chunks = []
finish_reason_count = 0
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.role:
assert delta.role == "assistant"
if delta.content:
chunks.append(delta.content)
if chunk.choices[0].finish_reason is not None:
finish_reason_count += 1
# finish reason should only return in last block
assert finish_reason_count == 1
assert chunk.choices[0].finish_reason == stop_reason
assert delta.content
assert "".join(chunks) == output


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS)
async def test_multi_image_input(server, client: openai.AsyncOpenAI,
model_name: str, image_url: str):

messages = [{
"role":
"user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "text",
"text": "What's in this image?"
},
],
}]

with pytest.raises(openai.BadRequestError): # test multi-image input
await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
temperature=0.0,
)

# the server should still work afterwards
completion = await client.completions.create(
model=model_name,
prompt=[0, 0, 0, 0, 0],
max_tokens=5,
temperature=0.0,
)
completion = completion.choices[0].text
assert completion is not None and len(completion) >= 0


if __name__ == "__main__":
pytest.main([__file__])
Loading
Loading