Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc/Feature]: Llava 1.5 in OpenAI compatible server #3873

Closed
stikkireddy opened this issue Apr 5, 2024 · 10 comments
Closed

[Doc/Feature]: Llava 1.5 in OpenAI compatible server #3873

stikkireddy opened this issue Apr 5, 2024 · 10 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed

Comments

@stikkireddy
Copy link

📚 The doc issue

Hey vLLM team it looks like there is added support for llava 1.5 but there are no docs or examples on how to use it via the api server. Are there any reference examples? For using llava via the OpenAI sdk?

Suggest a potential alternative/fix

No response

@stikkireddy stikkireddy added the documentation Improvements or additions to documentation label Apr 5, 2024
@simon-mo simon-mo added the help wanted Extra attention is needed label Apr 5, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Apr 5, 2024

I believe image input protocol has not been implemented indeed! This is more than documentation.

@simon-mo simon-mo changed the title [Doc]: Llava 1.5 documentation via OpenAI compatible server [Doc/Feature]: Llava 1.5 in OpenAI compatible server Apr 5, 2024
@simon-mo simon-mo added the good first issue Good for newcomers label Apr 5, 2024
@alsichcan
Copy link

alsichcan commented Apr 9, 2024

The PR #3042 , which introduced the LLaVA feature, appears not to incorporate functionalities for the OpenAI-compatible server. Based on the documentation, it's feasible to extend the existing OpenAI-compatible server (See Image Input tab in following Link) to support this feature without the need to develop a dedicated server specifically for image inputs. However, it's important to note the distinctions between GPT-4V and LLaVA, particularly that LLaVA currently does not support multiple image inputs and the 'detail' parameter.

According to OpenAI Documentation,

GPT-4 with vision is currently available to all developers who have access to GPT-4 via the gpt-4-vision-preview model and the Chat Completions API which has been updated to support image inputs.

  • GPT-4 Turbo with vision may behave slightly differently than GPT-4 Turbo, due to a system message we automatically insert into the conversation
  • GPT-4 Turbo with vision is the same as the GPT-4 Turbo preview model and performs equally as well on text tasks but has vision capabilities added
  • Vision is just one of many capabilities the model has

Example of uploading base64 encoded images

import base64
import requests

# OpenAI API Key
api_key = "YOUR_OPENAI_API_KEY"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

Please inform me if anyone is already working on implementing this feature.
If not, I'm willing to take on the task and aim to complete it by the end of April. (Hopefully)

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 9, 2024

Based on examples/llava_example.py, I have recently forked vllm-rocm to support image input by refactoring OpenAIServingChat. I have already verified that the model generates useful output when given OpenAI's quick start example.

Note: This change adds pillow as a dependency since it is used to read the image from bytes.

However, there is more work to be done:

  • The only model I have tested so far is llava-hf/llava-1.5-7b-hf since vLLM has existing support for its LlavaForConditionalGeneration architecture. Unfortunately, their config does not provide a chat template, so you have to provide it via command line (--chat-template examples/template_llava.jinja) which is quite inconvenient.
  • We should at least add support for LlavaLlamaForCausalLM architecture which is adopted by the original author (liuhaotian/llava-v1.5-7b).
  • It is unclear whether this change enables the API to work with other VLMs.

UPDATE: I have created a new branch on my fork (openai-vision-api) that consolidates my changes so far. The original upstream branch is now directly synced with upstream/upstream (discarding my previous commits) to be in line with the usual naming conventions.

@stikkireddy
Copy link
Author

thankfully i only need llava 😄!
@DarkLight1337 do you plan on pushing this back to vllm along with the chat template?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 9, 2024

thankfully i only need llava 😄! @DarkLight1337 do you plan on pushing this back to vllm along with the chat template?

I'll create a PR once more testing has been done.

It would be great if we could compile a list of models that work/don't work with my implementation of this API. Currently, I assume that at most one image is provided since it appears that this is also the case for vLLM internals. How difficult would it be to support multiple images (possibly of different sizes)?

Copy link
Collaborator

simon-mo commented Apr 9, 2024

Do there exist models support multiple image inputs?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 9, 2024

GPT-4's API supports multiple images, so I guess their model can already handle such input.

Looking at open source, I found that MMICL explicitly supports multiple images per text prompt. They use <imagej> as the token to represent the jth image. To accommodate this, we may need to add a config option to specify how to insert image tokens into the text prompt. Currently, we use <image> * image_feature_size to represent each image; it would be more convenient to follow the original models which only use a single <image> token per image, regardless of feature size.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 10, 2024

I have opened a PR to support single-image input, with a POC using llava-hf/llava-1.5-7b-hf. Hopefully, this is enough to get the ball rolling.

We can deal with multi-image input further down the line.

NOTE: If you have previously checked out upstream branch based on this issue, please note that my changes have been moved to the openai-vision-api branch; the upstream branch is now directly synced with upstream/upstream (discarding my previous commits) to be in line with the usual naming conventions.

@ywang96
Copy link
Member

ywang96 commented May 25, 2024

FYI - this is WIP and we plan to have it in the next major release. See our plan here #4194 (comment)

@ywang96
Copy link
Member

ywang96 commented Jun 7, 2024

Closing this as we merged #5237

@ywang96 ywang96 closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@simon-mo @alsichcan @DarkLight1337 @stikkireddy @ywang96 and others