Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Registry for processing model inputs #5214

Merged
merged 63 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
34bfa79
Introduce a higher level `INPUT_REGISTRY`
DarkLight1337 Jun 3, 2024
df2aa19
Move dummy data generation to input registry
DarkLight1337 Jun 3, 2024
c72d2b3
Update docs
DarkLight1337 Jun 3, 2024
d8c6488
Rename `process_input` to `map_input`
DarkLight1337 Jun 3, 2024
f18de48
Reorder arguments
DarkLight1337 Jun 3, 2024
653537d
Apply input processor
DarkLight1337 Jun 3, 2024
a2f5a3c
Remove `VisionLanguageConfig` from input mapper
DarkLight1337 Jun 3, 2024
378ad80
Fix bad use of `functools.partial`
DarkLight1337 Jun 3, 2024
7aa3778
Use default input processor
DarkLight1337 Jun 3, 2024
c774168
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 4, 2024
532f863
Fix wrong arguments
DarkLight1337 Jun 4, 2024
080d40c
Use pillow image instead of tensor to avoid bypassing the processor b…
DarkLight1337 Jun 5, 2024
662693a
Update interface of dummy data factory and input processor
DarkLight1337 Jun 5, 2024
9bc5fcc
Use `InputContext` to handle checked type cast of config types
DarkLight1337 Jun 5, 2024
29c3bb3
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
7bb6cbf
Add sanity check
DarkLight1337 Jun 6, 2024
ccf49c4
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
3482d32
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
8ea8468
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
be3d64f
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
2ff5be6
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 10, 2024
8e2ff86
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
553f684
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
b134dfc
Update name
DarkLight1337 Jun 11, 2024
7e33706
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
3fb622c
Remove `MULTIMODAL` convenience property as it was causing some (impo…
DarkLight1337 Jun 11, 2024
6a70e4f
Add docs
DarkLight1337 Jun 12, 2024
52a0116
Add docs
DarkLight1337 Jun 12, 2024
b7a8683
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 12, 2024
25f9949
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 13, 2024
fd7d954
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
49dac3e
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
0104218
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
18cc7e0
Set up dummy data factory for phi3v
DarkLight1337 Jun 18, 2024
2291617
Move dummy data factories to model files
DarkLight1337 Jun 18, 2024
adf5503
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
fecf1f0
Fix wrong feature size
DarkLight1337 Jun 18, 2024
086e0fe
Fix wrong feature size
DarkLight1337 Jun 18, 2024
c036b86
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 24, 2024
bfa5aa9
Remove redundant code
DarkLight1337 Jun 24, 2024
07e695d
Apply isort
DarkLight1337 Jun 24, 2024
7229b07
Move `DummyImageDataFactories` into CLIP model file
DarkLight1337 Jun 25, 2024
d9a4150
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
4b947ad
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
9e82a26
Clarify docs and add todo
DarkLight1337 Jun 26, 2024
6b19e6c
Expand docs
DarkLight1337 Jun 26, 2024
f451668
Add ref
DarkLight1337 Jun 26, 2024
1abb8a7
Add docs
DarkLight1337 Jun 26, 2024
698830f
Fix name
DarkLight1337 Jun 26, 2024
36ab12d
Fix and add links
DarkLight1337 Jun 26, 2024
bf3281c
modify llava_next
ywang96 Jun 27, 2024
56e2d3b
Update comment
DarkLight1337 Jun 27, 2024
d2f8c6d
Update docs
DarkLight1337 Jun 27, 2024
7c197d2
Use dynamic image feature size calculation
DarkLight1337 Jun 27, 2024
f5ffd3e
Fix phi3v not handling `image_sizes` correctly
DarkLight1337 Jun 27, 2024
66aad21
Apply formatter
DarkLight1337 Jun 27, 2024
f2e4633
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
a6e3162
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
ce06541
Fix config
DarkLight1337 Jun 27, 2024
7e80ecc
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
487d742
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
43350b8
update example
ywang96 Jun 28, 2024
57791de
update doc
ywang96 Jun 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Move DummyImageDataFactories into CLIP model file
  • Loading branch information
DarkLight1337 committed Jun 25, 2024
commit 7229b076acccfd9db13827711adb39720193dc48
69 changes: 65 additions & 4 deletions vllm/model_executor/models/clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,80 @@

import torch
import torch.nn as nn
from PIL import Image
from transformers import CLIPVisionConfig
from transformers.models.clip.modeling_clip import CLIPAttention

from vllm.model_executor.layers.activation import get_act_fn
from vllm.model_executor.layers.linear import (ColumnParallelLinear,
RowParallelLinear)
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig)
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.multimodal.image import ImageFeatureData, ImagePixelData
from vllm.sequence import SequenceData


def get_clip_num_patches(*, image_size: int, patch_size: int) -> int:
def get_clip_patch_grid_length(*, image_size: int, patch_size: int) -> int:
assert image_size % patch_size == 0
return (image_size // patch_size)**2
return image_size // patch_size


def get_clip_num_patches(*, image_size: int, patch_size: int) -> int:
grid_length = get_clip_patch_grid_length(image_size=image_size,
patch_size=patch_size)
return grid_length * grid_length


def get_clip_image_feature_size(hf_config: CLIPVisionConfig) -> int:
return get_clip_num_patches(image_size=hf_config.image_size,
patch_size=hf_config.patch_size)


def dummy_seq_data_for_clip(
hf_config: CLIPVisionConfig,
seq_len: int,
*,
image_token_id: int,
image_feature_size_override: Optional[int] = None,
):
if image_feature_size_override is None:
image_feature_size = get_clip_image_feature_size(hf_config)
else:
image_feature_size = image_feature_size_override

token_ids = [image_token_id] * image_feature_size
token_ids += [0] * (seq_len - image_feature_size)
return SequenceData(token_ids)


def dummy_pixel_data_for_clip(
hf_config: CLIPVisionConfig,
*,
image_width_override: Optional[int] = None,
image_height_override: Optional[int] = None,
):
width = height = hf_config.image_size
if image_width_override is not None:
width = image_width_override
if image_height_override is not None:
height = image_height_override

image = Image.new("RGB", (width, height), color=0)
return ImagePixelData(image)


def dummy_feature_data_for_clip(
hf_config: CLIPVisionConfig,
*,
image_feature_size_override: Optional[int] = None,
):
if image_feature_size_override is None:
image_feature_size = get_clip_image_feature_size(hf_config)
else:
image_feature_size = image_feature_size_override

values = torch.zeros((1, image_feature_size, hf_config.hidden_size),
dtype=torch.float16)
return ImageFeatureData(values)


# Adapted from https://github.com/huggingface/transformers/blob/v4.39.0/src/transformers/models/clip/modeling_clip.py#L164 # noqa
Expand Down
11 changes: 5 additions & 6 deletions vllm/model_executor/models/llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@
from vllm.model_executor.models.llama import LlamaModel
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalData
from vllm.multimodal.image import DummyImageDataFactories
from vllm.sequence import SamplerOutput

from .clip import (dummy_feature_data_for_clip, dummy_pixel_data_for_clip,
dummy_seq_data_for_clip)
from .vlm_base import VisionLanguageModelBase

_KEYS_TO_MODIFY_MAPPING = {
Expand Down Expand Up @@ -90,7 +91,7 @@ def dummy_data_for_llava(ctx: InputContext, seq_len: int):
vision_config = hf_config.vision_config

if isinstance(vision_config, CLIPVisionConfig):
seq_data = DummyImageDataFactories.dummy_seq_data_for_clip(
seq_data = dummy_seq_data_for_clip(
vision_config,
seq_len,
image_token_id=hf_config.image_token_index,
Expand All @@ -100,11 +101,9 @@ def dummy_data_for_llava(ctx: InputContext, seq_len: int):
ImageInputType = VisionLanguageConfig.ImageInputType
mm_data: MultiModalData
if image_input_type == ImageInputType.PIXEL_VALUES:
mm_data = DummyImageDataFactories.dummy_pixel_data_for_clip(
vision_config)
mm_data = dummy_pixel_data_for_clip(vision_config)
elif image_input_type == ImageInputType.IMAGE_FEATURES:
mm_data = DummyImageDataFactories.dummy_feature_data_for_clip(
vision_config)
mm_data = dummy_feature_data_for_clip(vision_config)

return seq_data, mm_data

Expand Down
16 changes: 10 additions & 6 deletions vllm/model_executor/models/llava_next.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@
from vllm.model_executor.models.llama import LlamaModel
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalData
from vllm.multimodal.image import (DummyImageDataFactories, ImagePixelData,
get_clip_num_patches)
from vllm.multimodal.image import ImagePixelData
from vllm.sequence import SamplerOutput

from .clip import (dummy_feature_data_for_clip, dummy_pixel_data_for_clip,
dummy_seq_data_for_clip, get_clip_patch_grid_length)
from .llava import LlavaMultiModalProjector, merge_vision_embeddings
from .vlm_base import VisionLanguageModelBase

Expand Down Expand Up @@ -93,7 +94,10 @@ def _get_llava_next_image_feature_size(
vision_config = hf_config.vision_config

if isinstance(vision_config, CLIPVisionConfig):
num_patches = get_clip_num_patches(vision_config)
num_patches = get_clip_patch_grid_length(
image_size=vision_config.image_size,
patch_size=vision_config.patch_size,
)
base_feature_size = num_patches * num_patches

num_patch_height, num_patch_width = get_anyres_image_grid_shape(
Expand Down Expand Up @@ -127,7 +131,7 @@ def dummy_data_for_llava_next(ctx: InputContext, seq_len: int):
hf_config, input_height=dummy_height, input_width=dummy_width)

if isinstance(vision_config, CLIPVisionConfig):
seq_data = DummyImageDataFactories.dummy_seq_data_for_clip(
seq_data = dummy_seq_data_for_clip(
vision_config,
seq_len,
image_token_id=hf_config.image_token_index,
Expand All @@ -138,13 +142,13 @@ def dummy_data_for_llava_next(ctx: InputContext, seq_len: int):
ImageInputType = VisionLanguageConfig.ImageInputType
mm_data: MultiModalData
if image_input_type == ImageInputType.PIXEL_VALUES:
mm_data = DummyImageDataFactories.dummy_pixel_data_for_clip(
mm_data = dummy_pixel_data_for_clip(
vision_config,
image_width_override=dummy_width,
image_height_override=dummy_height,
)
elif image_input_type == ImageInputType.IMAGE_FEATURES:
mm_data = DummyImageDataFactories.dummy_feature_data_for_clip(
mm_data = dummy_feature_data_for_clip(
vision_config,
image_feature_size_override=image_feature_size,
)
Expand Down
8 changes: 5 additions & 3 deletions vllm/model_executor/models/phi3v.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,11 @@
from vllm.model_executor.models.vlm_base import VisionLanguageModelBase
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.multimodal import MULTIMODAL_REGISTRY
from vllm.multimodal.image import DummyImageDataFactories, ImagePixelData
from vllm.multimodal.image import ImagePixelData
from vllm.sequence import SamplerOutput

from .clip import dummy_pixel_data_for_clip, dummy_seq_data_for_clip

logger = init_logger(__name__)

_KEYS_TO_MODIFY_MAPPING = {
Expand Down Expand Up @@ -275,13 +277,13 @@ class Phi3VImagePixelInputs(TypedDict):


def dummy_data_for_phi3v(ctx: InputContext, seq_len: int):
seq_data = DummyImageDataFactories.dummy_seq_data_for_clip(
seq_data = dummy_seq_data_for_clip(
CLIP_VIT_LARGE_PATCH14_336_CONFIG,
seq_len,
image_token_id=32044,
image_feature_size_override=1921,
)
mm_data = DummyImageDataFactories.dummy_pixel_data_for_clip(
mm_data = dummy_pixel_data_for_clip(
CLIP_VIT_LARGE_PATCH14_336_CONFIG,
image_width_override=1344,
image_height_override=1008,
Expand Down
74 changes: 3 additions & 71 deletions vllm/multimodal/image.py
Original file line number Diff line number Diff line change
@@ -1,87 +1,19 @@
from functools import lru_cache
from typing import Dict, Optional, Type, Union
from typing import Dict, Type, Union

import torch
from PIL import Image
from transformers import CLIPVisionConfig

from vllm.config import ModelConfig
from vllm.inputs.registry import InputContext
from vllm.logger import init_logger
from vllm.model_executor.models.clip import get_clip_num_patches
from vllm.sequence import SequenceData
from vllm.transformers_utils.image_processor import get_image_processor

from .base import MultiModalData, MultiModalPlugin

logger = init_logger(__name__)

_cached_get_image_processor = lru_cache(get_image_processor)


def get_clip_image_feature_size(hf_config: CLIPVisionConfig) -> int:
return get_clip_num_patches(image_size=hf_config.image_size,
patch_size=hf_config.patch_size)


class DummyImageDataFactories:
"""
Contains factories for dummy image data factories.

See Also:
:data:`vllm.inputs.registry.DummyDataFactory`
"""

@classmethod
def dummy_seq_data_for_clip(
cls,
hf_config: CLIPVisionConfig,
seq_len: int,
*,
image_token_id: int,
image_feature_size_override: Optional[int] = None,
):
if image_feature_size_override is None:
image_feature_size = get_clip_image_feature_size(hf_config)
else:
image_feature_size = image_feature_size_override

token_ids = [image_token_id] * image_feature_size
token_ids += [0] * (seq_len - image_feature_size)
return SequenceData(token_ids)

@classmethod
def dummy_pixel_data_for_clip(
cls,
hf_config: CLIPVisionConfig,
*,
image_width_override: Optional[int] = None,
image_height_override: Optional[int] = None,
):
width = height = hf_config.image_size
if image_width_override is not None:
width = image_width_override
if image_height_override is not None:
height = image_height_override

image = Image.new("RGB", (width, height), color=0)
return ImagePixelData(image)

@classmethod
def dummy_feature_data_for_clip(
cls,
hf_config: CLIPVisionConfig,
*,
image_feature_size_override: Optional[int] = None,
):
if image_feature_size_override is None:
image_feature_size = get_clip_image_feature_size(hf_config)
else:
image_feature_size = image_feature_size_override

values = torch.zeros((1, image_feature_size, hf_config.hidden_size),
dtype=torch.float16)
return ImageFeatureData(values)
cached_get_image_processor = lru_cache(get_image_processor)


class ImagePixelData(MultiModalData):
Expand Down Expand Up @@ -120,7 +52,7 @@ def _get_hf_image_processor(self, model_config: ModelConfig):
if vlm_config is None or vlm_config.image_processor is None:
return None

return _cached_get_image_processor(
return cached_get_image_processor(
vlm_config.image_processor,
trust_remote_code=model_config.trust_remote_code,
revision=vlm_config.image_processor_revision,
Expand Down
Loading