Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Closed
wants to merge 60 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
874a581
Add basic support for OpenAI image input API
DarkLight1337 Apr 8, 2024
607434e
Update documentation
DarkLight1337 Apr 9, 2024
aaa6bfe
Add tests for OpenAI image input API and image loader
DarkLight1337 Apr 9, 2024
26e7b2a
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 11, 2024
44829b5
Apply formatter
DarkLight1337 Apr 11, 2024
bccb367
Place image before text for `llava-hf` model
DarkLight1337 Apr 11, 2024
b9302e8
Internally enable customization of merging image with text prompt
DarkLight1337 Apr 11, 2024
a44d7d1
Fix errors in CI/CD
DarkLight1337 Apr 11, 2024
561ad49
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 12, 2024
4479605
Fix some type errors along the way
DarkLight1337 Apr 12, 2024
20852d9
Improve async behaviour of loading images
DarkLight1337 Apr 12, 2024
ce770f4
Use discriminated union in prompt parsing
DarkLight1337 Apr 12, 2024
6b016bc
Fix some type errors along the way
DarkLight1337 Apr 12, 2024
7620354
Some more fixes
DarkLight1337 Apr 12, 2024
7c3e6d9
Apply formatter
DarkLight1337 Apr 12, 2024
e74b0a7
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 12, 2024
9925dcb
Move `openai` to common requirements
DarkLight1337 Apr 12, 2024
ceb4e35
Fix typo in `_parse_chat_message_image_input`
DarkLight1337 Apr 12, 2024
7bdc84e
Refactor prompt parsing so that it can be shared between Chat Complet…
DarkLight1337 Apr 12, 2024
a7d1098
Make code more readable
DarkLight1337 Apr 12, 2024
8b9d636
Move assertion to a more appropriate place
DarkLight1337 Apr 12, 2024
9754142
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 12, 2024
c48c13a
Add code documentation
DarkLight1337 Apr 12, 2024
3530362
Decompose `_validate_prompt_and_tokenize`
DarkLight1337 Apr 12, 2024
b8feec9
Fix missing import due to renaming
DarkLight1337 Apr 12, 2024
9cae113
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 12, 2024
89d9086
Merge branch 'upstream' into openai-typing
DarkLight1337 Apr 13, 2024
cc1a5b3
Fix bug when parsing array of tokens
DarkLight1337 Apr 13, 2024
f9c1135
Add token array to batch completions testing
DarkLight1337 Apr 13, 2024
ecc2d50
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 14, 2024
f2e8180
Replace legacy `conint` with `Annotated` field
DarkLight1337 Apr 14, 2024
ce04842
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 14, 2024
cdbf08a
Load image processor from HuggingFace
DarkLight1337 Apr 14, 2024
9a336ec
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 14, 2024
5722dd8
Allow disabling image processor
DarkLight1337 Apr 14, 2024
6e1fa67
Fix errors when running the example and tests
DarkLight1337 Apr 15, 2024
7ce44da
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 15, 2024
9804604
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 16, 2024
21434df
Add test for loading image processor by revision
DarkLight1337 Apr 16, 2024
a5907b0
Temporary patch for llava-1.5-13b to facilitate testing
DarkLight1337 Apr 16, 2024
f08ff10
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 17, 2024
c126646
Fix issue with pickling config when serving LLaVA with multiple GPUs
DarkLight1337 Apr 17, 2024
49ba216
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 18, 2024
11e9921
Add TODO to test
DarkLight1337 Apr 18, 2024
7ae80a2
Try to avoid OOM by using `--enforce-eager`
DarkLight1337 Apr 18, 2024
2610bea
Reduce number of models to test to avoid OOM
DarkLight1337 Apr 18, 2024
5ad2b67
Try testing 13b model only
DarkLight1337 Apr 18, 2024
696357b
Refactor image processing, `MultiModalData` and LLaVA model
DarkLight1337 Apr 18, 2024
483b190
Fix image processing not working directly, due to tensor being passed
DarkLight1337 Apr 18, 2024
3e22017
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 18, 2024
0b6af35
Revert to using 7b model in testing
DarkLight1337 Apr 18, 2024
e4c3502
Get LLaVA-Next to work with fixed-size images
DarkLight1337 Apr 18, 2024
21aaf3d
Apply formatter and fix typo
DarkLight1337 Apr 18, 2024
ac95b79
Fix input shape not being based on config value
DarkLight1337 Apr 18, 2024
9a9a4e7
Allow config to specify other image size for LLaVA-NeXT
DarkLight1337 Apr 18, 2024
176ad2c
Improve error message to show the expected `image_feature_size`
DarkLight1337 Apr 18, 2024
91ea044
Fix dtype mismatch in `multi_modal_kwargs`
DarkLight1337 Apr 19, 2024
cb19743
Fix LLaVA example and test w.r.t. image processing refactor
DarkLight1337 Apr 19, 2024
019f473
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 19, 2024
f882d99
Fix circular import and set return type
DarkLight1337 Apr 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Fix circular import and set return type
- These changes are propagated to the child PRs
  • Loading branch information
DarkLight1337 committed Apr 19, 2024
commit f882d99e528fd55062ab7012918ba6a0067f1bb5
2 changes: 1 addition & 1 deletion vllm/model_executor/models/llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def __init__(self, vision_hidden_size: int, text_hidden_size: int,
text_hidden_size,
bias=True)

def forward(self, image_features: torch.Tensor):
def forward(self, image_features: torch.Tensor) -> torch.Tensor:
hidden_states = self.linear_1(image_features)
hidden_states = self.act(hidden_states)
hidden_states = self.linear_2(hidden_states)
Expand Down
18 changes: 9 additions & 9 deletions vllm/sequence.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@
from PIL import Image

from vllm.block import LogicalTokenBlock
from vllm.config import ModelConfig, VisionLanguageConfig
from vllm.logger import init_logger
from vllm.lora.request import LoRARequest
from vllm.sampling_params import SamplingParams
from vllm.transformers_utils.image_processor import cached_get_image_processor

if TYPE_CHECKING:
from vllm.config import ModelConfig, VisionLanguageConfig
from vllm.spec_decode.metrics import SpecDecodeWorkerMetrics

logger = init_logger(__name__)
Expand Down Expand Up @@ -385,8 +385,8 @@ class MultiModalData(ABC):

@abstractmethod
def get_input_kwargs(
self, model_config: ModelConfig,
vlm_config: VisionLanguageConfig) -> Dict[str, torch.Tensor]:
self, model_config: "ModelConfig",
vlm_config: "VisionLanguageConfig") -> Dict[str, torch.Tensor]:
"""Returns a dictionary which are passed as keyword arguments to
:meth:`torch.nn.Module.forward`.
"""
Expand All @@ -401,8 +401,8 @@ def __init__(self, image: Image.Image) -> None:

self.image = image

def _get_image_processor(self, model_config: ModelConfig,
vlm_config: VisionLanguageConfig):
def _get_image_processor(self, model_config: "ModelConfig",
vlm_config: "VisionLanguageConfig"):
if vlm_config is None or vlm_config.image_processor is None:
return None

Expand All @@ -413,8 +413,8 @@ def _get_image_processor(self, model_config: ModelConfig,
)

def get_input_kwargs(
self, model_config: ModelConfig,
vlm_config: VisionLanguageConfig) -> Dict[str, torch.Tensor]:
self, model_config: "ModelConfig",
vlm_config: "VisionLanguageConfig") -> Dict[str, torch.Tensor]:
# Temporary patch to make LLaVA-NeXT usable
_, _, h, w = vlm_config.image_input_shape
image = self.image.resize((w, h))
Expand Down Expand Up @@ -444,8 +444,8 @@ def __init__(self, image_features: torch.Tensor) -> None:
self.image_features = image_features

def get_input_kwargs(
self, model_config: ModelConfig,
vlm_config: VisionLanguageConfig) -> Dict[str, torch.Tensor]:
self, model_config: "ModelConfig",
vlm_config: "VisionLanguageConfig") -> Dict[str, torch.Tensor]:
return {"image_features": self.image_features}


Expand Down
Loading