Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Dynamic image size support for VLMs #5276

Merged
merged 242 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
242 commits
Select commit Hold shift + click to select a range
34bfa79
Introduce a higher level `INPUT_REGISTRY`
DarkLight1337 Jun 3, 2024
df2aa19
Move dummy data generation to input registry
DarkLight1337 Jun 3, 2024
c72d2b3
Update docs
DarkLight1337 Jun 3, 2024
d8c6488
Rename `process_input` to `map_input`
DarkLight1337 Jun 3, 2024
f18de48
Reorder arguments
DarkLight1337 Jun 3, 2024
653537d
Apply input processor
DarkLight1337 Jun 3, 2024
a2f5a3c
Remove `VisionLanguageConfig` from input mapper
DarkLight1337 Jun 3, 2024
378ad80
Fix bad use of `functools.partial`
DarkLight1337 Jun 3, 2024
7aa3778
Use default input processor
DarkLight1337 Jun 3, 2024
c774168
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 4, 2024
532f863
Fix wrong arguments
DarkLight1337 Jun 4, 2024
080d40c
Use pillow image instead of tensor to avoid bypassing the processor b…
DarkLight1337 Jun 5, 2024
662693a
Update interface of dummy data factory and input processor
DarkLight1337 Jun 5, 2024
9bc5fcc
Use `InputContext` to handle checked type cast of config types
DarkLight1337 Jun 5, 2024
911cac7
Add input processor for injecting image tokens; fix docs
DarkLight1337 Jun 5, 2024
a38b347
Add new documentation pages
DarkLight1337 Jun 5, 2024
29c3bb3
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
9cfbcce
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
7bb6cbf
Add sanity check
DarkLight1337 Jun 6, 2024
ccf49c4
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
3482d32
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
8ea8468
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
be3d64f
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
2ff5be6
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 10, 2024
8e2ff86
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
553f684
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
b134dfc
Update name
DarkLight1337 Jun 11, 2024
1efa480
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
1a08444
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
7e33706
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
cfc31fd
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
3fb622c
Remove `MULTIMODAL` convenience property as it was causing some (impo…
DarkLight1337 Jun 11, 2024
da85ab2
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
383bea1
Update docs
DarkLight1337 Jun 11, 2024
80a09f2
Remove double processing of image tokens
DarkLight1337 Jun 12, 2024
6a70e4f
Add docs
DarkLight1337 Jun 12, 2024
8322ecb
Add docs
DarkLight1337 Jun 12, 2024
52a0116
Add docs
DarkLight1337 Jun 12, 2024
c1733dd
Add docs
DarkLight1337 Jun 12, 2024
b7a8683
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 12, 2024
9fb5e72
Remove more instances of double processing; update docs
DarkLight1337 Jun 13, 2024
25f9949
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 13, 2024
03c7e65
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 13, 2024
3932b3f
Remove xfail
DarkLight1337 Jun 13, 2024
7fa877a
Fix missing image token in OpenAI API serving
DarkLight1337 Jun 13, 2024
092e550
Fix LLaVA-NeXT test
DarkLight1337 Jun 14, 2024
7a19862
Remove duplicate processing in async engine
DarkLight1337 Jun 14, 2024
fd7d954
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
49dac3e
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
b2c6832
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 15, 2024
0104218
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
18cc7e0
Set up dummy data factory for phi3v
DarkLight1337 Jun 18, 2024
2291617
Move dummy data factories to model files
DarkLight1337 Jun 18, 2024
adf5503
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
e5a94e4
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 18, 2024
9b0386d
Move input processors to model files
DarkLight1337 Jun 18, 2024
4e656e7
Set up input processor for phi3v
DarkLight1337 Jun 18, 2024
fecf1f0
Fix wrong feature size
DarkLight1337 Jun 18, 2024
086e0fe
Fix wrong feature size
DarkLight1337 Jun 18, 2024
8c26a18
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 19, 2024
81522fe
Fix wrong feature size
DarkLight1337 Jun 19, 2024
c036b86
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 24, 2024
f75e1ab
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
b24e8d9
Update validation
DarkLight1337 Jun 24, 2024
8569d35
Fix image feature calculation for phi3v
DarkLight1337 Jun 24, 2024
bfa5aa9
Remove redundant code
DarkLight1337 Jun 24, 2024
dc34121
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
07e695d
Apply isort
DarkLight1337 Jun 24, 2024
8a43a77
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
825401d
Apply yapf
DarkLight1337 Jun 24, 2024
4a0d4d1
Reduce `max_tokens` so that test still passes
DarkLight1337 Jun 25, 2024
8d22fe0
Fix vllm to hf output (+ rename)
DarkLight1337 Jun 25, 2024
2e1ee2f
Fix wrong arguments
DarkLight1337 Jun 25, 2024
7229b07
Move `DummyImageDataFactories` into CLIP model file
DarkLight1337 Jun 25, 2024
17800fd
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 25, 2024
50f994b
Move `input_processor_for_clip` into CLIP
DarkLight1337 Jun 25, 2024
838aa9b
Remove some magic numbers
DarkLight1337 Jun 25, 2024
e7a5564
Test multiscale inputs for LLaVA-NeXT
DarkLight1337 Jun 25, 2024
36e8001
Handle multiscale inputs (different number of patches per batch) in L…
DarkLight1337 Jun 25, 2024
39e6d42
Fix wrong feature size
DarkLight1337 Jun 26, 2024
0d7f18f
Apply formatter
DarkLight1337 Jun 26, 2024
8e5dc7c
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
d9a4150
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
6849236
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
6d02491
Revert max_tokens
DarkLight1337 Jun 26, 2024
76ddea4
Add more tests for input mapper
DarkLight1337 Jun 26, 2024
4b20e66
Sanity check: Also test multiscale inputs for LLaVA-1.5
DarkLight1337 Jun 26, 2024
784af1a
Do not auto-convert image dtype to model's dtype
DarkLight1337 Jun 26, 2024
8e5fb12
Update prompts
DarkLight1337 Jun 26, 2024
4b947ad
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
e7397ee
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
865be7a
Fix mapper tests w.r.t. dtype change
DarkLight1337 Jun 26, 2024
9e82a26
Clarify docs and add todo
DarkLight1337 Jun 26, 2024
46391de
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
a4733f9
Remove TODO since vision config will be removed soon
DarkLight1337 Jun 26, 2024
6b19e6c
Expand docs
DarkLight1337 Jun 26, 2024
be326f2
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
f451668
Add ref
DarkLight1337 Jun 26, 2024
5c0c8cf
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
3d7b795
Update docs
DarkLight1337 Jun 26, 2024
1abb8a7
Add docs
DarkLight1337 Jun 26, 2024
428d420
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
698830f
Fix name
DarkLight1337 Jun 26, 2024
ac9ea9a
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
334b1a9
Add `MultiModalInputs` to docs
DarkLight1337 Jun 26, 2024
36ab12d
Fix and add links
DarkLight1337 Jun 26, 2024
af01e97
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
c303421
Fix `is_multiscale` not provided anymore
DarkLight1337 Jun 26, 2024
0a0c0e3
Also test multiscale input for phi3v
DarkLight1337 Jun 26, 2024
60517a7
Revert max_tokens for phi3v as numerical error still persists
DarkLight1337 Jun 26, 2024
57df434
Improve error message
DarkLight1337 Jun 26, 2024
ffe0675
Log the full output for easier reference
DarkLight1337 Jun 26, 2024
4f7b210
[VLM] Remove support for pixel_values and image_features.
xwjiang2010 Jun 25, 2024
c7a2a66
Update xfail to be more efficient
DarkLight1337 Jun 26, 2024
598e0e3
Also xfail llava test
DarkLight1337 Jun 26, 2024
174ca90
address comments
xwjiang2010 Jun 26, 2024
5b3e9aa
remove image_input_type altogether.
xwjiang2010 Jun 26, 2024
b7acf3a
types
xwjiang2010 Jun 26, 2024
f22b219
format
xwjiang2010 Jun 26, 2024
f84d87a
Update comment
DarkLight1337 Jun 27, 2024
5dfb6fc
Update docs
DarkLight1337 Jun 27, 2024
bbeff03
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
bf3281c
modify llava_next
ywang96 Jun 27, 2024
56e2d3b
Update comment
DarkLight1337 Jun 27, 2024
d2f8c6d
Update docs
DarkLight1337 Jun 27, 2024
7c197d2
Use dynamic image feature size calculation
DarkLight1337 Jun 27, 2024
f5ffd3e
Fix phi3v not handling `image_sizes` correctly
DarkLight1337 Jun 27, 2024
66aad21
Apply formatter
DarkLight1337 Jun 27, 2024
d1c68c0
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
5f32d53
Add see also
DarkLight1337 Jun 27, 2024
15df4ef
Update examples prompt format
DarkLight1337 Jun 27, 2024
f2e4633
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
095e008
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
a6e3162
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
28922af
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
ce06541
Fix config
DarkLight1337 Jun 27, 2024
cdcc2d4
Fix config
DarkLight1337 Jun 27, 2024
4212abf
Update docs
DarkLight1337 Jun 27, 2024
07c08e3
Update docs
DarkLight1337 Jun 27, 2024
f3f5854
Fix `MultiModalInputs` not working in Python 3.8
DarkLight1337 Jun 27, 2024
bebf9e7
Fix `_ImageAssets` not working in Python 3.8
DarkLight1337 Jun 27, 2024
7e80ecc
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
487d742
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
36f72b6
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
43350b8
update example
ywang96 Jun 28, 2024
57791de
update doc
ywang96 Jun 28, 2024
b2b1e11
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
fbc5f70
Update docs
DarkLight1337 Jun 28, 2024
4292ccb
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
5d23a96
Apply formatter
DarkLight1337 Jun 28, 2024
78064e0
Fix OpenAI server not working for phi3v
DarkLight1337 Jun 28, 2024
4cb809c
Preemptively handle upcoming models
DarkLight1337 Jun 28, 2024
754e238
Add more models
DarkLight1337 Jun 28, 2024
9edb53c
Update feature size for dummy data
DarkLight1337 Jun 28, 2024
91d6c1e
Merge branch 'main' of https://github.com/vllm-project/vllm into remo…
xwjiang2010 Jun 28, 2024
f84b793
format
xwjiang2010 Jun 28, 2024
a934663
ExternalMultiModalDataDict
xwjiang2010 Jun 28, 2024
2144d3a
mention schema
xwjiang2010 Jun 28, 2024
2795b16
Use a less strict check
DarkLight1337 Jun 29, 2024
86ffd60
Fix phi3v test
DarkLight1337 Jun 29, 2024
f339dd1
Update default length as the dummy image feature size is increased
DarkLight1337 Jun 29, 2024
59a7a4c
Raise full error if output is completely different
DarkLight1337 Jun 29, 2024
62952e1
Fix phi3v not using input processor
DarkLight1337 Jun 29, 2024
0ce3ecb
Move size factors outside
DarkLight1337 Jun 29, 2024
b43e8c3
Apply formatter
DarkLight1337 Jun 29, 2024
9023794
Fix some outputs not being checked
DarkLight1337 Jun 29, 2024
fc5549c
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
f6c8061
Also test no image
DarkLight1337 Jun 30, 2024
15cc847
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
235c8a9
Batch by size factors
DarkLight1337 Jun 30, 2024
b98d924
Factor out xfail code
DarkLight1337 Jun 30, 2024
2c2558b
Fix unused args
DarkLight1337 Jun 30, 2024
ec28eca
Check logprobs instead of xfailing
DarkLight1337 Jun 30, 2024
5a337f5
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
2eb3490
Fix different scales not being in the same batch
DarkLight1337 Jun 30, 2024
6301a52
Apply suggestions from code review
DarkLight1337 Jun 30, 2024
14f10fc
Add link
DarkLight1337 Jun 30, 2024
7c335c3
Use `self.multi_modal_projector` directly
DarkLight1337 Jun 30, 2024
33c860e
Allow users to send image token formatted prompt directly
DarkLight1337 Jun 30, 2024
e03bc57
Factor out the code for placeholder token IDs
DarkLight1337 Jun 30, 2024
b270ac3
Remove `-rx` flag
DarkLight1337 Jun 30, 2024
3161221
Fix distributed tests
DarkLight1337 Jun 30, 2024
85d108a
Fix string mismatch warning
DarkLight1337 Jun 30, 2024
d648e32
Relax phi3v test; add TODO for llava tests
DarkLight1337 Jun 30, 2024
fde5f26
Fix distributed tests
DarkLight1337 Jun 30, 2024
d432934
address comments
xwjiang2010 Jul 1, 2024
83cfada
Merge branch 'main' of https://github.com/vllm-project/vllm into remo…
xwjiang2010 Jul 1, 2024
ab347bc
format
xwjiang2010 Jul 1, 2024
404700f
rm ctx
xwjiang2010 Jul 1, 2024
6a4014e
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 1, 2024
95a1fc5
Fix distributed test
DarkLight1337 Jul 1, 2024
1e87823
Update docs about prompt formatting
DarkLight1337 Jul 1, 2024
55ab3e4
Remove unused parameter
DarkLight1337 Jul 1, 2024
21da5b8
Remove unused import
DarkLight1337 Jul 1, 2024
525fe8f
Fix distributed test
DarkLight1337 Jul 1, 2024
04ebb68
rm ImageData and MultiModalData
xwjiang2010 Jul 1, 2024
31b8b09
rm external
xwjiang2010 Jul 1, 2024
a4b5617
comments
xwjiang2010 Jul 1, 2024
045674d
fix dist gpu test.
xwjiang2010 Jul 1, 2024
c8fa150
address comments
xwjiang2010 Jul 2, 2024
58ab8e9
Further avoid cuda init
DarkLight1337 Jul 2, 2024
6975caa
Add warnings for repeated image tokens
DarkLight1337 Jul 2, 2024
b1f1813
docs
xwjiang2010 Jul 2, 2024
b8b636d
Update vllm/multimodal/base.py
xwjiang2010 Jul 2, 2024
2c1d291
format
xwjiang2010 Jul 2, 2024
b6401d3
Reword
DarkLight1337 Jul 2, 2024
0f6f64c
Merge branch 'remove_image_features_2' of https://github.com/xwjiang2…
DarkLight1337 Jul 2, 2024
89f1103
Remove useless test
DarkLight1337 Jul 2, 2024
47fbdba
Unify test API between HfRunner and VllmRunner
DarkLight1337 Jul 2, 2024
c1c5a4d
Fix import error
DarkLight1337 Jul 2, 2024
fde4b25
Fix attribute error
DarkLight1337 Jul 2, 2024
4278fed
fix import error
ywang96 Jul 2, 2024
d9a2908
update llava next example
ywang96 Jul 2, 2024
d61e8af
Merge branch 'remove_image_features_2' of https://github.com/xwjiang2…
DarkLight1337 Jul 2, 2024
abd56fc
Update comments
DarkLight1337 Jul 2, 2024
ce2516e
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 2, 2024
38042ab
Remove some unnecessary deferred imports
DarkLight1337 Jul 2, 2024
7a6d895
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 2, 2024
9a49d2c
Use more precise type annotation
DarkLight1337 Jul 2, 2024
ac6f4fa
Fix wrong feature size
DarkLight1337 Jul 2, 2024
3f95778
Fix wrong image
DarkLight1337 Jul 2, 2024
90e80c4
Remove unnecessary lazy import
DarkLight1337 Jul 2, 2024
ea622c7
Check for conflicting kwargs in `map_input`
DarkLight1337 Jul 2, 2024
18740c2
Avoid unnecessary processing
DarkLight1337 Jul 2, 2024
a0db2c7
Update doc
DarkLight1337 Jul 2, 2024
526a871
Avoid cuda init
DarkLight1337 Jul 2, 2024
a5174da
Remove unused logger
DarkLight1337 Jul 2, 2024
6cf34e4
Remove unnecessary deferred imports
DarkLight1337 Jul 2, 2024
feff395
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 2, 2024
aacb5d0
Fix typo
DarkLight1337 Jul 2, 2024
13f43bd
Address comments
DarkLight1337 Jul 2, 2024
00e9e39
Add comment
DarkLight1337 Jul 2, 2024
288bfb9
Merge branch 'main' into mm-image-tokenizer-2
ywang96 Jul 2, 2024
284fca8
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 3, 2024
a231eaf
Update XPU runner's multimodal logic
DarkLight1337 Jul 3, 2024
ec74121
Fix unused import
DarkLight1337 Jul 3, 2024
d16d3c8
Fix feature size calculation
DarkLight1337 Jul 3, 2024
aaa0f1f
Add extra image to test
DarkLight1337 Jul 3, 2024
cc540c3
Support multimodal data for neuron and tpu
DarkLight1337 Jul 3, 2024
48489ef
Fix broadcasting
DarkLight1337 Jul 3, 2024
2adc41f
Fix OpenVINO model runner for multimodal data
DarkLight1337 Jul 3, 2024
0e6845f
Cleanup
DarkLight1337 Jul 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Support multimodal data for neuron and tpu
  • Loading branch information
DarkLight1337 committed Jul 3, 2024
commit cc540c3e2b6069587f2afe70c0e1c55f3a2a8fd6
37 changes: 31 additions & 6 deletions vllm/worker/neuron_model_runner.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from dataclasses import dataclass
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
from typing import (TYPE_CHECKING, Any, Dict, List, Mapping, Optional, Tuple,
Union)

import torch
from torch import nn
Expand All @@ -9,6 +10,8 @@
from vllm.logger import init_logger
from vllm.model_executor import SamplingMetadata
from vllm.model_executor.model_loader.neuron import get_neuron_model
from vllm.multimodal import (MULTIMODAL_REGISTRY, BatchedTensors,
MultiModalInputs)
from vllm.sequence import (IntermediateTensors, SamplerOutput,
SequenceGroupMetadata)
from vllm.utils import is_pin_memory_available, make_tensor_with_pad
Expand All @@ -29,6 +32,7 @@ class ModelInputForNeuron(ModelRunnerInputBase):
input_positions: Optional[torch.Tensor] = None
input_block_ids: Optional[torch.Tensor] = None
sampling_metadata: Optional["SamplingMetadata"] = None
multi_modal_kwargs: Optional[Mapping[str, BatchedTensors]] = None

def as_broadcastable_tensor_dict(
self) -> Dict[str, Union[int, torch.Tensor]]:
Expand Down Expand Up @@ -65,6 +69,10 @@ def __init__(
self.device = self.device_config.device
self.pin_memory = is_pin_memory_available()

# Multi-modal data support
self.multi_modal_input_mapper = MULTIMODAL_REGISTRY \
.create_input_mapper(self.model_config)

# Lazy initialization.
self.model: nn.Module # initialize after load_model.

Expand All @@ -76,13 +84,15 @@ def load_model(self) -> None:
def _prepare_prompt(
self,
seq_group_metadata_list: List[SequenceGroupMetadata],
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, List[int]]:
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, List[int], Mapping[
str, BatchedTensors]]:
assert len(seq_group_metadata_list) > 0
input_tokens: List[List[int]] = []
input_positions: List[List[int]] = []
input_block_ids: List[int] = []

seq_lens: List[int] = []
multi_modal_inputs_list: List[MultiModalInputs] = []
for seq_group_metadata in seq_group_metadata_list:
assert seq_group_metadata.is_prompt
seq_ids = list(seq_group_metadata.seq_data.keys())
Expand All @@ -102,6 +112,12 @@ def _prepare_prompt(
assert len(block_table) == 1
input_block_ids.append(block_table[0])

mm_data = seq_group_metadata.multi_modal_data
if mm_data:
# Process multi-modal data
mm_kwargs = self.multi_modal_input_mapper(mm_data)
multi_modal_inputs_list.append(mm_kwargs)

max_seq_len = max(seq_lens)
assert max_seq_len > 0
input_tokens = make_tensor_with_pad(input_tokens,
Expand All @@ -118,7 +134,11 @@ def _prepare_prompt(
dtype=torch.long,
device=self.device)

return input_tokens, input_positions, input_block_ids, seq_lens
multi_modal_kwargs = MultiModalInputs.batch(multi_modal_inputs_list,
device=self.device)

return (input_tokens, input_positions, input_block_ids, seq_lens,
multi_modal_kwargs)

def _prepare_decode(
self,
Expand Down Expand Up @@ -184,8 +204,9 @@ def prepare_model_input(
is_prompt = seq_group_metadata_list[0].is_prompt
# Prepare input tensors.
if is_prompt:
(input_tokens, input_positions, input_block_ids,
seq_lens) = self._prepare_prompt(seq_group_metadata_list)
(input_tokens, input_positions, input_block_ids, seq_lens,
multi_modal_kwargs
) = self._prepare_prompt(seq_group_metadata_list)
else:
(input_tokens, input_positions,
input_block_ids) = self._prepare_decode(seq_group_metadata_list)
Expand All @@ -203,7 +224,8 @@ def prepare_model_input(
return ModelInputForNeuron(input_tokens=input_tokens,
input_positions=input_positions,
input_block_ids=input_block_ids,
sampling_metadata=sampling_metadata)
sampling_metadata=sampling_metadata,
multi_modal_kwargs=multi_modal_kwargs)

@torch.inference_mode()
def execute_model(
Expand All @@ -217,10 +239,13 @@ def execute_model(
raise ValueError(
"NeuronModelRunner does not support multi-step execution.")

multi_modal_kwargs = model_input.multi_modal_kwargs or {}

hidden_states = self.model(
input_ids=model_input.input_tokens,
positions=model_input.input_positions,
input_block_ids=model_input.input_block_ids,
**multi_modal_kwargs,
)

# Compute the logits.
Expand Down
45 changes: 40 additions & 5 deletions vllm/worker/tpu_model_runner.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import time
from typing import List, Optional, Tuple
from typing import List, Mapping, Optional, Tuple

import numpy as np
import torch
Expand All @@ -12,6 +12,8 @@
from vllm.logger import init_logger
from vllm.model_executor.model_loader import get_model
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.multimodal import (MULTIMODAL_REGISTRY, BatchedTensors,
MultiModalInputs)
from vllm.sequence import (CompletionSequenceGroupOutput, Logprob,
SamplerOutput, SequenceGroupMetadata,
SequenceOutput)
Expand Down Expand Up @@ -66,6 +68,10 @@ def __init__(
False,
)

# Multi-modal data support
self.multi_modal_input_mapper = MULTIMODAL_REGISTRY \
.create_input_mapper(self.model_config)

def load_model(self) -> None:
self.device = self.device_config.device

Expand Down Expand Up @@ -193,12 +199,14 @@ def warmup_model(
def _prepare_prompt(
self,
seq_group_metadata_list: List[SequenceGroupMetadata],
):
) -> Tuple[torch.Tensor, torch.Tensor, AttentionMetadata, torch.Tensor,
Mapping[str, BatchedTensors]]:
assert len(seq_group_metadata_list) > 0
input_tokens: List[List[int]] = []
input_positions: List[List[int]] = []
prompt_lens: List[int] = []
slot_mapping: List[List[int]] = []
multi_modal_inputs_list: List[MultiModalInputs] = []

for seq_group_metadata in seq_group_metadata_list:
assert seq_group_metadata.is_prompt
Expand All @@ -224,6 +232,11 @@ def _prepare_prompt(
slot = block_number * self.block_size + block_offset
slot_mapping[-1].append(slot)

mm_data = seq_group_metadata.multi_modal_data
if mm_data:
mm_kwargs = self.multi_modal_input_mapper(mm_data)
multi_modal_inputs_list.append(mm_kwargs)

assert len(prompt_lens) > 0
num_prefills = len(prompt_lens)
num_prefill_tokens = sum(prompt_lens)
Expand Down Expand Up @@ -261,17 +274,24 @@ def _prepare_prompt(
block_tables=None,
context_lens=None,
)
return input_tokens, input_positions, attn_metadata, prompt_lens

multi_modal_kwargs = MultiModalInputs.batch(multi_modal_inputs_list,
device=self.device)

return (input_tokens, input_positions, attn_metadata, prompt_lens,
multi_modal_kwargs)

def _prepare_decode(
self,
seq_group_metadata_list: List[SequenceGroupMetadata],
):
) -> Tuple[torch.Tensor, torch.Tensor, AttentionMetadata, torch.Tensor,
Mapping[str, BatchedTensors]]:
assert len(seq_group_metadata_list) > 0
input_tokens: List[List[int]] = []
input_positions: List[List[int]] = []
slot_mapping: List[List[int]] = []
context_lens: List[int] = []
multi_modal_inputs_list: List[MultiModalInputs] = []

batch_idx = 0
for seq_group_metadata in seq_group_metadata_list:
Expand All @@ -297,6 +317,11 @@ def _prepare_decode(
slot = block_number * self.block_size + block_offset
slot_mapping.append([slot])

mm_data = seq_group_metadata.multi_modal_data
if mm_data:
mm_kwargs = self.multi_modal_input_mapper(mm_data)
multi_modal_inputs_list.append(mm_kwargs)

batch_size = _get_padded_batch_size(batch_idx)
num_paddings = batch_size - batch_idx
input_tokens = input_tokens + [[0]] * num_paddings
Expand Down Expand Up @@ -330,7 +355,12 @@ def _prepare_decode(
block_tables=block_tables,
context_lens=context_lens,
)
return input_tokens, input_positions, attn_metadata, input_lens

multi_modal_kwargs = MultiModalInputs.batch(multi_modal_inputs_list,
device=self.device)

return (input_tokens, input_positions, attn_metadata, input_lens,
multi_modal_kwargs)

def _prepare_sample(
self,
Expand Down Expand Up @@ -483,6 +513,7 @@ def forward(
kv_caches: List[Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]],
attn_metadata: AttentionMetadata,
input_lens: torch.Tensor,
multi_modal_kwargs: Optional[Mapping[str, BatchedTensors]],
t: torch.Tensor,
p: torch.Tensor,
num_samples: int,
Expand All @@ -496,6 +527,8 @@ def forward(
memory profiling at initialization.
attn_metadata: The Pallas attention metadata.
input_lens: The actual input lengths of shape [batch_size].
multi_modal_kwargs: Keyword arguments from multi-modal data to
pass to the model.
t: The sampling temperature of shape [batch_size].
p: The top-p probability of shape [batch_size].
"""
Expand Down Expand Up @@ -535,11 +568,13 @@ def forward(
slot_mapping = slot_mapping.flatten()
attn_metadata.slot_mapping = slot_mapping

multi_modal_kwargs = multi_modal_kwargs or {}
hidden_states = self.model(
token_ids,
position_ids,
kv_caches,
attn_metadata,
**multi_modal_kwargs,
)
hidden_states = hidden_states.flatten(0, 1)
logits = self.model.compute_logits(hidden_states, sampling_metadata)
Expand Down
Loading