Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Video Llava #29733

Merged
merged 57 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
dce6678
add model draft
zucchini-nlp Mar 19, 2024
72626df
update docstring
zucchini-nlp Mar 20, 2024
8cca731
add tests
zucchini-nlp Mar 20, 2024
4ea4f70
support image and video as input
zucchini-nlp Mar 20, 2024
c36819d
update for better handling of mixed input and clean-up a bit
zucchini-nlp Mar 21, 2024
c1a8fd5
bug when mixed inputs & add tests
zucchini-nlp Apr 8, 2024
c591c75
Update README.md
zucchini-nlp Apr 8, 2024
5ff8d18
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp Apr 8, 2024
a6bc68d
link to abstract of paper in README
zucchini-nlp Apr 8, 2024
eb309ed
fix test
zucchini-nlp Apr 8, 2024
2f46f6c
fix-copies
zucchini-nlp Apr 8, 2024
6b51b7e
Merge branch 'main' into video_llava
zucchini-nlp Apr 8, 2024
e112958
make tests happy
zucchini-nlp Apr 8, 2024
5cb6163
skip docstest for now
zucchini-nlp Apr 10, 2024
930147d
do not run doctest for now
zucchini-nlp Apr 18, 2024
24ec2b3
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp Apr 18, 2024
142bfc0
Update src/transformers/models/video_llava/processing_video_llava.py
zucchini-nlp Apr 22, 2024
fdec895
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
e83251c
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
4fcfe72
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
327030d
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
33289a5
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 22, 2024
dfef75a
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
ebf1042
address review comments
zucchini-nlp Apr 22, 2024
aa1b278
failing tests
zucchini-nlp Apr 22, 2024
7802922
Fix vocab_size in common tests for VLMs
zucchini-nlp Apr 23, 2024
9fce414
codestyle
zucchini-nlp Apr 23, 2024
e8b4569
Merge branch 'huggingface:main' into video_llava
zucchini-nlp Apr 23, 2024
bb1cc26
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp Apr 29, 2024
e2e92b2
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp Apr 29, 2024
5c77fff
Update src/transformers/models/video_llava/modeling_video_llava.py
zucchini-nlp Apr 29, 2024
99518cb
Update src/transformers/models/video_llava/modeling_video_llava.py
zucchini-nlp Apr 29, 2024
451fd72
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
95a9a01
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
347fa8c
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 30, 2024
3e2f1b4
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
3cd1222
Update src/transformers/models/video_llava/processing_video_llava.py
zucchini-nlp Apr 30, 2024
242703a
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
9c1a10d
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
b4145e1
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
5803d5a
PR suggestions
zucchini-nlp Apr 30, 2024
975d959
fix-copies
zucchini-nlp Apr 30, 2024
7f30e3b
Merge branch 'main' into video_llava
zucchini-nlp Apr 30, 2024
6bdad81
Merge branch 'huggingface:main' into video_llava
zucchini-nlp May 1, 2024
a817f31
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp May 8, 2024
dba80e2
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp May 8, 2024
6b3eafb
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp May 8, 2024
ba4e125
add full example in docs
zucchini-nlp May 8, 2024
6cc8af1
clean-up with new model-id
zucchini-nlp May 10, 2024
885a5ae
[run-slow] video_llava
zucchini-nlp May 10, 2024
377aafe
update docstring
zucchini-nlp May 10, 2024
637b197
Merge branch 'main' into video_llava
zucchini-nlp May 10, 2024
a411347
[run-slow] video_llava
zucchini-nlp May 10, 2024
0d83eaf
Merge branch 'huggingface:main' into video_llava
zucchini-nlp May 14, 2024
8134039
remove all achive maps
zucchini-nlp May 15, 2024
8e15514
fix some tests
zucchini-nlp May 15, 2024
5d1e976
test was supposed to be skipped for llava :)
zucchini-nlp May 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix some tests
  • Loading branch information
zucchini-nlp committed May 15, 2024
commit 8e15514e09bbbf6f076bafad05f98cabfd2db3ce
21 changes: 12 additions & 9 deletions src/transformers/models/video_llava/modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,9 +311,6 @@ def _merge_input_ids_with_visual_features(
final_embedding = torch.zeros(
batch_size, max_seq_len, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
)
final_attention_mask = torch.zeros(
batch_size, max_seq_len, dtype=attention_mask.dtype, device=inputs_embeds.device
)
final_input_ids = torch.full(
(batch_size, max_seq_len), self.pad_token_id, dtype=input_ids.dtype, device=inputs_embeds.device
)
Expand All @@ -325,12 +322,10 @@ def _merge_input_ids_with_visual_features(
non_image_indices.to(target_device),
text_to_overwrite.to(target_device),
)
attention_mask = attention_mask.to(target_device)

# 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
# we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
final_input_ids[batch_indices, text_to_overwrite] = input_ids[batch_indices, non_image_indices]
if labels is not None:
final_labels = torch.full(
Expand All @@ -354,8 +349,18 @@ def _merge_input_ids_with_visual_features(
)

final_embedding[image_to_overwrite] = visual_features.contiguous().reshape(-1, embed_dim).to(target_device)
final_attention_mask |= image_to_overwrite
position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)

if attention_mask is not None:
final_attention_mask = torch.zeros(
batch_size, max_seq_len, dtype=attention_mask.dtype, device=inputs_embeds.device
)
attention_mask = attention_mask.to(target_device)
final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
final_attention_mask |= image_to_overwrite
position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
else:
final_attention_mask = None
position_ids = None

return final_embedding, final_attention_mask, final_labels, position_ids, final_input_ids

Expand Down Expand Up @@ -544,8 +549,6 @@ def forward(
labels,
num_frames=8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it always be 8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, VideoLlava was trained and has to be used with 8 video frames. I will add it in the model docs page in "usage tips" section

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we should validate this at the start of the validate call and raise an exception if the input isn't the correct shape

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been added? Skimming I didn't spot but might have just missed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added in _get_vision_features(), after that we can never know how many frames we have

)
if labels is None:
labels = torch.full_like(attention_mask, self.config.ignore_index).to(torch.long)
else:
# In case input_ids.shape[1] == 1 & past_key_values != None, we are in the case of
# generation with cache
Expand Down
2 changes: 1 addition & 1 deletion tests/models/video_llava/test_modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def __init__(
self.num_attention_heads = text_config["num_attention_heads"]
self.is_training = is_training

self.batch_size = 3
self.batch_size = 5
self.num_channels = 3
self.image_size = 224
self.encoder_seq_length = 2044
Expand Down
Loading