-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Model] Initialize Phi-3-vision support #4986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
251752c
init phi3v support
Isotr0py 70e7017
make phi3v work
Isotr0py 618a2cb
remove debug code
Isotr0py ffb32fb
remove dropout from Phi3ImageEmbedding
Isotr0py 76e6f8e
clean code
Isotr0py 58330ba
optimize code structure
Isotr0py b61c2be
Add Phi3VImagePixelInputs
Isotr0py 1e3e18c
refactor image embedding
Isotr0py 591a9b8
format phi3v_example
Isotr0py b8599fc
Merge branch 'vllm-project:main' into phi3v
Isotr0py 2e5dc27
refactor phi3v
Isotr0py e7fe213
Merge branch 'vllm-project:main' into phi3v
Isotr0py 1b1f3f2
remove phi3v feature inputs
Isotr0py 1a9e43f
Merge branch 'vllm-project:main' into phi3v
Isotr0py 8db0d25
refactor phi3v
Isotr0py 7388bcd
deprecate phi3v image_input
Isotr0py 3f3d2b8
add phi3v test
Isotr0py 0705a62
fix phi3v test
Isotr0py 5f21d3f
format code
Isotr0py fe4e594
Merge branch 'vllm-project:main' into phi3v
Isotr0py 3b7f86a
add phi3_v to get_full_image_text_prompt and test marker
Isotr0py ced2c3d
add docs
Isotr0py 1d82bb1
clear phi3v model implement
Isotr0py 4739c45
ignore phi3v cpu test
Isotr0py 59fe2c1
fix doc strings
Isotr0py 38ed4d9
fix phi3v test flash_attn import
Isotr0py d2fbecf
fix phi3v test
Isotr0py d99b684
Merge branch 'main' into phi3v
Isotr0py 2bbaecd
add torchvision to requirements-test.txt
Isotr0py ce62fad
increase phi3v max_model_len to 2048
Isotr0py 4b242dd
Merge branch 'vllm-project:main' into phi3v
Isotr0py 1d78590
decrease phi3v max_tokens to 8
Isotr0py 7a3ba90
Merge branch 'vllm-project:main' into phi3v
Isotr0py b95672f
Merge branch 'vllm-project:main' into phi3v
Isotr0py da1392c
optimize image embedding and update requirements.txt
Isotr0py 9c080be
remove changing input_ids to -1
Isotr0py 0cd9d26
fix a typo and image embedding
Isotr0py e77bb76
update comment for phi3v
Isotr0py File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import os | ||
import subprocess | ||
|
||
from PIL import Image | ||
|
||
from vllm import LLM, SamplingParams | ||
from vllm.multimodal.image import ImagePixelData | ||
|
||
|
||
def run_phi3v(): | ||
model_path = "microsoft/Phi-3-vision-128k-instruct" | ||
llm = LLM( | ||
model=model_path, | ||
trust_remote_code=True, | ||
max_model_len=4096, | ||
image_input_type="pixel_values", | ||
image_token_id=32044, | ||
image_input_shape="1,3,1008,1344", | ||
image_feature_size=1921, | ||
disable_image_processor=False, | ||
) | ||
|
||
image = Image.open("images/cherry_blossom.jpg") | ||
|
||
# single-image prompt | ||
prompt = "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n" # noqa: E501 | ||
prompt = prompt.replace("<|image_1|>", "<|image|>" * 1921 + "<s>") | ||
|
||
sampling_params = SamplingParams(temperature=0, max_tokens=64) | ||
|
||
outputs = llm.generate({ | ||
"prompt": prompt, | ||
"sampling_params": sampling_params, | ||
"multi_modal_data": ImagePixelData(image), | ||
}) | ||
for o in outputs: | ||
generated_text = o.outputs[0].text | ||
print(generated_text) | ||
|
||
|
||
if __name__ == "__main__": | ||
s3_bucket_path = "s3://air-example-data-2/vllm_opensource_llava/" | ||
local_directory = "images" | ||
|
||
# Make sure the local directory exists or create it | ||
os.makedirs(local_directory, exist_ok=True) | ||
|
||
# Use AWS CLI to sync the directory, assume anonymous access | ||
subprocess.check_call([ | ||
"aws", | ||
"s3", | ||
"sync", | ||
s3_bucket_path, | ||
local_directory, | ||
"--no-sign-request", | ||
]) | ||
run_phi3v() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
from typing import List, Tuple | ||
|
||
import pytest | ||
from transformers import AutoTokenizer | ||
|
||
from vllm.config import VisionLanguageConfig | ||
from vllm.utils import is_cpu | ||
|
||
from ..conftest import IMAGE_FILES | ||
|
||
pytestmark = pytest.mark.llava | ||
|
||
# The image token is placed before "user" on purpose so that the test can pass | ||
HF_IMAGE_PROMPTS = [ | ||
"<|user|>\n<|image_1|>\nWhat's the content of the image?<|end|>\n<|assistant|>\n", # noqa: E501 | ||
"<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n", | ||
] | ||
|
||
assert len(HF_IMAGE_PROMPTS) == len(IMAGE_FILES) | ||
|
||
|
||
def iter_phi3v_configs(model_name: str): | ||
image_hw_to_feature_size = { | ||
(1008, 1344): 1921, | ||
} | ||
|
||
for (h, w), f in image_hw_to_feature_size.items(): | ||
for input_type, input_shape in [ | ||
(VisionLanguageConfig.ImageInputType.PIXEL_VALUES, (1, 3, h, w)), | ||
]: | ||
yield (model_name, | ||
VisionLanguageConfig(image_input_type=input_type, | ||
image_feature_size=f, | ||
image_token_id=32044, | ||
image_input_shape=input_shape, | ||
image_processor=model_name, | ||
image_processor_revision=None)) | ||
|
||
|
||
model_and_vl_config = [ | ||
*iter_phi3v_configs("microsoft/Phi-3-vision-128k-instruct"), | ||
] | ||
|
||
|
||
def vllm_to_hf_output(vllm_output: Tuple[List[int], str], | ||
vlm_config: VisionLanguageConfig, model_id: str): | ||
"""Sanitize vllm output to be comparable with hf output. | ||
The function reduces `input_ids` from 1, 32000, 32000, ..., 32000, | ||
x1, x2, x3 ... to 1, 32000, x1, x2, x3 ... | ||
It also reduces `output_str` from "<image><image>bla" to "bla". | ||
""" | ||
input_ids, output_str = vllm_output | ||
image_token_id = vlm_config.image_token_id | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
image_token_str = tokenizer.decode(image_token_id) | ||
|
||
hf_input_ids = [ | ||
input_id if input_id != image_token_id else 0 | ||
for idx, input_id in enumerate(input_ids) | ||
] | ||
hf_output_str = output_str \ | ||
.replace(image_token_str * vlm_config.image_feature_size, "") \ | ||
.replace("<s>", " ").replace("<|user|>", "") \ | ||
.replace("<|end|>\n<|assistant|>", " ") | ||
|
||
return hf_input_ids, hf_output_str | ||
|
||
|
||
target_dtype = "half" | ||
if is_cpu(): | ||
target_dtype = "bfloat16" | ||
|
||
|
||
# TODO: Add test for `tensor_parallel_size` [ref: PR #3883] | ||
# Since we use _attn_implementation="eager" for hf_runner, here is | ||
# numeric difference for longer context and test can't pass | ||
@pytest.mark.parametrize("model_and_config", model_and_vl_config) | ||
@pytest.mark.parametrize("dtype", [target_dtype]) | ||
@pytest.mark.parametrize("max_tokens", [8]) | ||
def test_models(hf_runner, vllm_runner, hf_images, vllm_images, | ||
model_and_config, dtype: str, max_tokens: int) -> None: | ||
"""Inference result should be the same between hf and vllm. | ||
|
||
All the image fixtures for the test is under tests/images. | ||
For huggingface runner, we provide the PIL images as input. | ||
For vllm runner, we provide MultiModalData objects and corresponding | ||
vision language config as input. | ||
Note, the text input is also adjusted to abide by vllm contract. | ||
The text output is sanitized to be able to compare with hf. | ||
""" | ||
model_id, vlm_config = model_and_config | ||
|
||
# use eager mode for hf runner, since phi3_v didn't work with flash_attn | ||
hf_model_kwargs = {"_attn_implementation": "eager"} | ||
with hf_runner(model_id, dtype=dtype, | ||
model_kwargs=hf_model_kwargs) as hf_model: | ||
hf_outputs = hf_model.generate_greedy(HF_IMAGE_PROMPTS, | ||
max_tokens, | ||
images=hf_images) | ||
|
||
vllm_image_prompts = [ | ||
p.replace("<|image_1|>", | ||
"<|image|>" * vlm_config.image_feature_size + "<s>") | ||
for p in HF_IMAGE_PROMPTS | ||
] | ||
|
||
with vllm_runner(model_id, | ||
max_model_len=2048, | ||
dtype=dtype, | ||
enforce_eager=True, | ||
**vlm_config.as_cli_args_dict()) as vllm_model: | ||
vllm_outputs = vllm_model.generate_greedy(vllm_image_prompts, | ||
max_tokens, | ||
images=vllm_images) | ||
|
||
for i in range(len(HF_IMAGE_PROMPTS)): | ||
hf_output_ids, hf_output_str = hf_outputs[i] | ||
vllm_output_ids, vllm_output_str = vllm_to_hf_output( | ||
vllm_outputs[i], vlm_config, model_id) | ||
assert hf_output_str == vllm_output_str, ( | ||
f"Test{i}:\nHF: {hf_output_str!r}\nvLLM: {vllm_output_str!r}") | ||
assert hf_output_ids == vllm_output_ids, ( | ||
f"Test{i}:\nHF: {hf_output_ids}\nvLLM: {vllm_output_ids}") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.