Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand inputs in processors for VLMs #30962

Merged
merged 44 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
050657f
let it be
zucchini-nlp May 20, 2024
a67087e
draft
zucchini-nlp May 22, 2024
1e2b873
should not have changed
zucchini-nlp May 22, 2024
70145d4
add warnings
zucchini-nlp May 29, 2024
16a6787
Merge remote-tracking branch 'upstream/main' into vlm_processors
zucchini-nlp May 29, 2024
8472035
fix & add tests
zucchini-nlp May 29, 2024
13af9e8
fix tests
zucchini-nlp May 29, 2024
41d086f
ipnuts embeds cannot be passed with pixels
zucchini-nlp May 29, 2024
bf59ed6
more updates
zucchini-nlp Jun 7, 2024
020e7ed
paligemma ready!
zucchini-nlp Jun 10, 2024
3e0455c
minor typos
zucchini-nlp Jun 10, 2024
674f16e
update blip-2
zucchini-nlp Jun 10, 2024
42ae646
fix tests & raise error
zucchini-nlp Jun 10, 2024
b5259f2
Merge branch 'main' into vlm_processors
zucchini-nlp Jun 10, 2024
a6c50de
docstring
zucchini-nlp Jun 10, 2024
4766e2e
add blip2 test
zucchini-nlp Jun 10, 2024
d46df90
Merge branch 'main' into vlm_processors
zucchini-nlp Jun 10, 2024
f74297b
tmp
zucchini-nlp Jun 17, 2024
5fc8565
add image seq length to config
zucchini-nlp Jun 18, 2024
1b4674a
update docstring
zucchini-nlp Jun 18, 2024
c3c130b
Merge branch 'main' into vlm_processors
zucchini-nlp Jun 18, 2024
8438875
delete
zucchini-nlp Jun 18, 2024
bf9e637
fix tests
zucchini-nlp Jun 18, 2024
db1fa4f
fix blip
zucchini-nlp Jun 18, 2024
246b06a
fix paligemma
zucchini-nlp Jun 21, 2024
222bf9a
merge `main`
zucchini-nlp Jul 18, 2024
5486215
out-of-place scatter
zucchini-nlp Jul 18, 2024
78c4484
add llava-next-video
zucchini-nlp Jul 18, 2024
d60624e
Update src/transformers/models/blip_2/modeling_blip_2.py
zucchini-nlp Aug 5, 2024
1973b39
remove tmp
zucchini-nlp Aug 5, 2024
a6e380f
merge `main`
zucchini-nlp Aug 5, 2024
8e88d8b
codestyle
zucchini-nlp Aug 5, 2024
689eed9
nits
zucchini-nlp Aug 6, 2024
28e8054
more nits
zucchini-nlp Aug 6, 2024
637e514
remove overriding in tests
zucchini-nlp Aug 6, 2024
be939d8
comprehension when merging video
zucchini-nlp Aug 6, 2024
232eb7c
fix-copies
zucchini-nlp Aug 6, 2024
385a617
revert changes for embeds test
zucchini-nlp Aug 6, 2024
4831a7e
fix tests after making comprehension
zucchini-nlp Aug 6, 2024
85fbff9
Update src/transformers/models/blip_2/processing_blip_2.py
zucchini-nlp Aug 8, 2024
119178f
Update src/transformers/models/blip_2/processing_blip_2.py
zucchini-nlp Aug 8, 2024
2451911
more updates
zucchini-nlp Aug 8, 2024
414031e
fix tests
zucchini-nlp Aug 8, 2024
8cfad20
Merge remote-tracking branch 'upstream/main' into vlm_processors
zucchini-nlp Aug 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
out-of-place scatter
  • Loading branch information
zucchini-nlp committed Jul 18, 2024
commit 548621566df47cf720c1c0e38f09d53fa27b39c7
3 changes: 2 additions & 1 deletion src/transformers/models/blip_2/modeling_blip_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1774,7 +1774,8 @@ def forward(
# otherwise we expand manually by concating
if hasattr(self.config, "image_token_index"):
special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
inputs_embeds[special_image_mask] = language_model_inputs.flatten()
language_model_inputs = language_model_inputs.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, language_model_inputs)
else:
logger.warning_once(
"Expanding inputs for image tokens in BLIP-2 should be done in processing. "
Expand Down
2 changes: 0 additions & 2 deletions src/transformers/models/blip_2/processing_blip_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,6 @@ class Blip2Processor(ProcessorMixin):
image_processor_class = "BlipImageProcessor"
tokenizer_class = "AutoTokenizer"


# Copied from transformers.models.blip.processing_blip.BlipProcessor.__init__
def __init__(self, image_processor, tokenizer, num_query_tokens=None, **kwargs):
tokenizer.return_token_type_ids = False
self.current_processor = image_processor
Expand Down
3 changes: 2 additions & 1 deletion src/transformers/models/llava/modeling_llava.py
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,8 @@ def forward(
special_image_mask = (
(input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
)
inputs_embeds[special_image_mask] = image_features.flatten()
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

outputs = self.language_model(
attention_mask=attention_mask,
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/models/llava/processing_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,10 @@ class LlavaProcessor(ProcessorMixin):
vision_feature_select_strategy (`str`, *optional*):
The feature selection strategy used to select the vision feature from the vision backbone.
Shoudl be same as in model's config
image_token (`str`, *optional*, defaults to `"<image>"`):
Special token used to denote image location.
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
image_token (`str`, *optional*, defaults to `"<image>"`):
Special token used to denote image location.
"""

attributes = ["image_processor", "tokenizer"]
Expand Down
3 changes: 2 additions & 1 deletion src/transformers/models/llava_next/modeling_llava_next.py
Original file line number Diff line number Diff line change
Expand Up @@ -867,7 +867,8 @@ def forward(
special_image_mask = (
(input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
)
inputs_embeds[special_image_mask] = image_features.flatten()
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

outputs = self.language_model(
attention_mask=attention_mask,
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/models/llava_next/processing_llava_next.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,10 @@ class LlavaNextProcessor(ProcessorMixin):
vision_feature_select_strategy (`str`, *optional*):
The feature selection strategy used to select the vision feature from the vision backbone.
Shoudl be same as in model's config
image_token (`str`, *optional*, defaults to `"<image>"`):
Special token used to denote image location.
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
image_token (`str`, *optional*, defaults to `"<image>"`):
Special token used to denote image location.
"""

attributes = ["image_processor", "tokenizer"]
Expand Down
3 changes: 2 additions & 1 deletion src/transformers/models/paligemma/modeling_paligemma.py
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -414,7 +414,8 @@ def forward(
f"Got {image_tokens_in_text} image tokens in the text but {image_features.shape[0] * image_features.shape[1]} "
"tokens from image embeddings."
)
inputs_embeds[special_image_mask] = image_features.flatten()
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

causal_mask = self._update_causal_mask(
attention_mask, token_type_ids, inputs_embeds, cache_position, is_training
Expand Down
6 changes: 4 additions & 2 deletions src/transformers/models/video_llava/modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -614,13 +614,15 @@ def forward(
special_image_mask = (
(input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
)
inputs_embeds[special_image_mask] = image_features.flatten()
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

if video_outputs is not None:
special_image_mask = (
(input_ids == self.config.video_token_index).unsqueeze(-1).expand_as(inputs_embeds)
)
inputs_embeds[special_image_mask] = video_features.flatten()
video_features = video_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, video_features)

outputs = self.language_model(
attention_mask=attention_mask,
Expand Down
3 changes: 2 additions & 1 deletion src/transformers/models/vipllava/modeling_vipllava.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,7 +498,8 @@ def forward(
special_image_mask = (
(input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
)
inputs_embeds[special_image_mask] = image_features.flatten()
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

outputs = self.language_model(
attention_mask=attention_mask,
Expand Down