Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Video Llava #29733

Merged
merged 57 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
dce6678
add model draft
zucchini-nlp Mar 19, 2024
72626df
update docstring
zucchini-nlp Mar 20, 2024
8cca731
add tests
zucchini-nlp Mar 20, 2024
4ea4f70
support image and video as input
zucchini-nlp Mar 20, 2024
c36819d
update for better handling of mixed input and clean-up a bit
zucchini-nlp Mar 21, 2024
c1a8fd5
bug when mixed inputs & add tests
zucchini-nlp Apr 8, 2024
c591c75
Update README.md
zucchini-nlp Apr 8, 2024
5ff8d18
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp Apr 8, 2024
a6bc68d
link to abstract of paper in README
zucchini-nlp Apr 8, 2024
eb309ed
fix test
zucchini-nlp Apr 8, 2024
2f46f6c
fix-copies
zucchini-nlp Apr 8, 2024
6b51b7e
Merge branch 'main' into video_llava
zucchini-nlp Apr 8, 2024
e112958
make tests happy
zucchini-nlp Apr 8, 2024
5cb6163
skip docstest for now
zucchini-nlp Apr 10, 2024
930147d
do not run doctest for now
zucchini-nlp Apr 18, 2024
24ec2b3
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp Apr 18, 2024
142bfc0
Update src/transformers/models/video_llava/processing_video_llava.py
zucchini-nlp Apr 22, 2024
fdec895
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
e83251c
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
4fcfe72
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
327030d
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
33289a5
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 22, 2024
dfef75a
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 22, 2024
ebf1042
address review comments
zucchini-nlp Apr 22, 2024
aa1b278
failing tests
zucchini-nlp Apr 22, 2024
7802922
Fix vocab_size in common tests for VLMs
zucchini-nlp Apr 23, 2024
9fce414
codestyle
zucchini-nlp Apr 23, 2024
e8b4569
Merge branch 'huggingface:main' into video_llava
zucchini-nlp Apr 23, 2024
bb1cc26
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp Apr 29, 2024
e2e92b2
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp Apr 29, 2024
5c77fff
Update src/transformers/models/video_llava/modeling_video_llava.py
zucchini-nlp Apr 29, 2024
99518cb
Update src/transformers/models/video_llava/modeling_video_llava.py
zucchini-nlp Apr 29, 2024
451fd72
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
95a9a01
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
347fa8c
Update src/transformers/models/video_llava/image_processing_video_lla…
zucchini-nlp Apr 30, 2024
3e2f1b4
Update docs/source/en/model_doc/video_llava.md
zucchini-nlp Apr 30, 2024
3cd1222
Update src/transformers/models/video_llava/processing_video_llava.py
zucchini-nlp Apr 30, 2024
242703a
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
9c1a10d
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
b4145e1
Update tests/models/video_llava/test_modeling_video_llava.py
zucchini-nlp Apr 30, 2024
5803d5a
PR suggestions
zucchini-nlp Apr 30, 2024
975d959
fix-copies
zucchini-nlp Apr 30, 2024
7f30e3b
Merge branch 'main' into video_llava
zucchini-nlp Apr 30, 2024
6bdad81
Merge branch 'huggingface:main' into video_llava
zucchini-nlp May 1, 2024
a817f31
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp May 8, 2024
dba80e2
Update src/transformers/models/video_llava/configuration_video_llava.py
zucchini-nlp May 8, 2024
6b3eafb
Merge remote-tracking branch 'upstream/main' into video_llava
zucchini-nlp May 8, 2024
ba4e125
add full example in docs
zucchini-nlp May 8, 2024
6cc8af1
clean-up with new model-id
zucchini-nlp May 10, 2024
885a5ae
[run-slow] video_llava
zucchini-nlp May 10, 2024
377aafe
update docstring
zucchini-nlp May 10, 2024
637b197
Merge branch 'main' into video_llava
zucchini-nlp May 10, 2024
a411347
[run-slow] video_llava
zucchini-nlp May 10, 2024
0d83eaf
Merge branch 'huggingface:main' into video_llava
zucchini-nlp May 14, 2024
8134039
remove all achive maps
zucchini-nlp May 15, 2024
8e15514
fix some tests
zucchini-nlp May 15, 2024
5d1e976
test was supposed to be skipped for llava :)
zucchini-nlp May 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
address review comments
  • Loading branch information
zucchini-nlp committed Apr 22, 2024
commit ebf1042ad19793b35b1ba8f4935dce3d0e1c2fb0
21 changes: 20 additions & 1 deletion docs/source/en/model_doc/video_llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,26 @@ for the LLM*

Tips:

<INSERT TIPS ABOUT MODEL HERE>
- We advise users to use padding_side="left" when computing batched generation as it leads to more accurate results. Simply make sure to call processor.tokenizer.padding_side = "left" before generating.

- Note the model has not been explicitly trained to process multiple images/videos in the same prompt, although this is technically possible, you may experience inaccurate results.

- For better results, we recommend users to prompt the model with the correct prompt format:
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved


```bash
"USER: <video>\n<prompt> ASSISTANT:"
```

For multiple turns conversation:

```bash
"USER: <video>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this to show a full example using the model? Users typically want to just copy-paste


- Note that the video inputs should have exactly 8 frames at the input, since the models was trained in that setting.
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved



This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/PKU-YuanGroup/Video-LLaVA).
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/image_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,9 @@
] # noqa


VideoInput = Union[np.ndarray, "torch.Tensor", List[np.ndarray], List["torch.Tensor"]] # noqa


class ChannelDimension(ExplicitEnum):
FIRST = "channels_first"
LAST = "channels_last"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,9 @@ class VideoLlavaConfig(PretrainedConfig):
The activation function used by the multimodal projector.
vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
The feature selection strategy used to select the vision feature from the CLIP backbone.
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
Can be either "full" to select all features or "default" to select features without `CLS`.
vision_feature_layer (`int`, *optional*, defaults to -2):
The index of the layer to select the vision feature.
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the VideoLlava model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`~VideoLlavaForConditionalGeneration`]

Example:

Expand Down Expand Up @@ -91,7 +89,6 @@ def __init__(
projector_hidden_act="gelu",
vision_feature_select_strategy="default",
vision_feature_layer=-2,
vocab_size=32000,
**kwargs,
):
self.ignore_index = ignore_index
Expand All @@ -100,7 +97,6 @@ def __init__(
self.projector_hidden_act = projector_hidden_act
self.vision_feature_select_strategy = vision_feature_select_strategy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should verify it's one of the two valid types here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.vision_feature_select_strategy is checked a few lines above, in self._get_vision_features(). We try to get the feature and if not raise ValueError

self.vision_feature_layer = vision_feature_layer
self.vocab_size = vocab_size

self.vision_config = vision_config

Expand All @@ -120,14 +116,12 @@ def __init__(
vocab_size=32000,
projection_dim=768,
)
self.vocab_size = self.vocab_size

self.text_config = text_config
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(self.text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved
self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
self.vocab_size = self.text_config.vocab_size
elif text_config is None:
self.text_config = CONFIG_MAPPING["llama"]()

Expand Down
35 changes: 13 additions & 22 deletions src/transformers/models/video_llava/image_processing_video_llava.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add tests for the image processor - in particular to test that it correctly handles just images, just videos and image + video inputs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests, but there is one thing to note. If we call directly the ImageProcessor class, it requires and argument images to be present. A workaround is to pass explicitly images=None for VideoLlavaImageProcessor, which I did for the tests.

I can override call and to make the argument images = None. so that it is optional, but not sure how good is overriding call. Also, I do not think many ppl call image processor explicitly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image processor takes both images and videos as input, and only one of them is required, then setting image = None seems reasonable

Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
ChannelDimension,
ImageInput,
PILImageResampling,
VideoInput,
infer_channel_dimension_format,
is_scaled_image,
is_valid_image,
Expand All @@ -40,7 +41,7 @@
validate_kwargs,
validate_preprocess_arguments,
)
from ...utils import TensorType, is_torch_available, is_vision_available, logging
from ...utils import TensorType, is_vision_available, logging


logger = logging.get_logger(__name__)
Expand All @@ -50,7 +51,7 @@
import PIL


def make_batched_videos(videos) -> List[List[ImageInput]]:
def make_batched_videos(videos) -> List[VideoInput]:
if isinstance(videos, (list, tuple)) and isinstance(videos[0], (list, tuple)) and is_valid_image(videos[0][0]):
return videos

Expand Down Expand Up @@ -205,8 +206,8 @@ def resize(

def preprocess(
self,
images: ImageInput,
videos: VideoInput,
images: List[ImageInput],
videos: List[VideoInput],
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
Expand Down Expand Up @@ -288,30 +289,20 @@ def preprocess(
image_std = image_std if image_std is not None else self.image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb

if not isinstance(visual_inputs, list):
visual_inputs = [visual_inputs]

images, videos = [], []
for visual in visual_inputs:
if not isinstance(visual, PIL.Image.Image) and len(visual.shape) == 4:
videos.append(visual)
else:
images.append(visual)

if len(images) > 0:
if images is not None:
images = make_list_of_images(images)
elif len(videos) > 0:
if videos is not None:
videos = make_batched_videos(videos)

validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

if not valid_images(videos) or not valid_images(images):
if (videos is not None and not valid_images(videos)) or (images is not None and not valid_images(images)):
raise ValueError(
"Invalid input type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)

if len(videos) > 0:
if videos is not None:
pixel_values_videos = [
[
self._preprocess_image(
Expand All @@ -335,7 +326,7 @@ def preprocess(
for video in videos
]

if len(images) > 0:
if images is not None:
pixel_values_images = [
self._preprocess_image(
image=image,
Expand All @@ -356,20 +347,20 @@ def preprocess(
for image in images
]

if len(images) > 0 and len(videos) > 0:
if images is not None and videos is not None:
encoded_outputs = BatchFeature(
data={
"pixel_values_videos": pixel_values_videos,
"pixel_values_images": pixel_values_images,
},
tensor_type=return_tensors,
)
elif len(images) > 0:
elif images is not None:
encoded_outputs = BatchFeature(
data={"pixel_values_images": pixel_values_images},
tensor_type=return_tensors,
)
elif len(videos) > 0:
elif videos is not None:
encoded_outputs = BatchFeature(
data={"pixel_values_videos": pixel_values_videos},
tensor_type=return_tensors,
Expand Down
37 changes: 31 additions & 6 deletions src/transformers/models/video_llava/modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,6 @@ class VideoLlavaPreTrainedModel(PreTrainedModel):
config_class = VideoLlavaConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["CLIPAttention"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn_2 = True

Expand Down Expand Up @@ -172,10 +171,14 @@ def _supports_sdpa(self):
[`PreTrainedTokenizer.__call__`] for details.

[What are input IDs?](../glossary#input-ids)
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)):
pixel_values_images (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)):
The tensors corresponding to the input images. Pixel values can be obtained using
[`AutoImageProcessor`]. See [`VideoLlavaImageProcessor.__call__`] for details ([]`LlavaProcessor`] uses
[`VideoLlavaImageProcessor`] for processing images).
pixel_values_videos (`torch.FloatTensor` of shape `(batch_size, num_frames, num_channels, image_size, image_size)):
The tensors corresponding to the input video. Pixel values can be obtained using
[`AutoImageProcessor`]. See [`VideoLlavaImageProcessor.__call__`] for details ([]`LlavaProcessor`] uses
[`VideoLlavaImageProcessor`] for processing videos).
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

Expand Down Expand Up @@ -431,20 +434,42 @@ def forward(
>>> from PIL import Image
>>> import requests
>>> import numpy as np
>>> from decord import VideoReader
>>> import av
>>> from huggingface_hub import hf_hub_download
>>> from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration


>>> def read_video_pyav(container, indices):
... '''
... Decode the video with PyAV decoder.
... Args:
... container (`av.container.input.InputContainer`): PyAV container.
... indices (`List[int]`): List of frame indices to decode.
... Returns:
... result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
... '''
... frames = []
... container.seek(0)
... start_index = indices[0]
... end_index = indices[-1]
... for i, frame in enumerate(container.decode(video=0)):
... if i > end_index:
... break
... if i >= start_index and i in indices:
... frames.append(frame)
... return np.stack([x.to_ndarray(format="rgb24") for x in frames])

>>> model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B")
>>> processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B")

>>> prompt = "USER: <image><image><image><image><image><image><image><image>Why is this video funny? ASSISTANT:"
>>> video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
>>> vr = VideoReader(uri=video_path, height=224, width=224)
>>> container = av.open(video_path)

>>> # sample uniformly 8 frames from the video
>>> indices = np.arange(0, len(vr), len(vr) / 8).astype(int)
>>> frames = vr.get_batch(indices).asnumpy()
>>> total_frames = container.streams.video[0].frames
>>> indices = np.arange(0, total_frames, total_frames / 8).astype(int)
>>> clip = read_video_pyav(container, indices)

>>> inputs = processor(text=prompt, visual_inputs=clip, return_tensors="pt")
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved

Expand Down
11 changes: 5 additions & 6 deletions src/transformers/models/video_llava/processing_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@ def __init__(self, image_processor=None, tokenizer=None):
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
visual_inputs: ImageInput = None,
images: ImageInput = None,
videos: ImageInput = None,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length=None,
Expand All @@ -64,7 +65,7 @@ def __call__(
of the above two methods for more information.

Args:
text (`str`, `List[str]`, `List[List[str]]`):
text (`TextInput`, `PreTokenizedInput`, `List[TextInput]`, `List[PreTokenizedInput]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
Expand Down Expand Up @@ -106,10 +107,8 @@ def __call__(
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if visual_inputs is not None:
image_kwargs = self.image_processor(
visual_inputs=visual_inputs, images=None, return_tensors=return_tensors
)
if images is not None or videos is not None:
image_kwargs = self.image_processor(images=images, videos=videos, return_tensors=return_tensors)
else:
image_kwargs = {}
text_inputs = self.tokenizer(
Expand Down
6 changes: 3 additions & 3 deletions tests/models/video_llava/test_modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -359,7 +359,7 @@ def test_small_model_integration_test_mixed_inputs(self):
)

prompts = [
"USER: <image>How many cats are there in the image? ASSISTANT:",
"USER: <image>What are the cats in the image doing? ASSISTANT:",
"USER: <video>Why is this video funny? ASSISTANT:",
]
video_file = hf_hub_download(
Expand All @@ -373,7 +373,7 @@ def test_small_model_integration_test_mixed_inputs(self):
output = model.generate(**inputs, do_sample=False, max_new_tokens=20)

EXPECTED_DECODED_TEXT = [
'USER: How many cats are there in the image? ASSISTANT: There are two cats in the image. hopefully, they are both sleeping.',
'USER: What are the cats in the image doing? ASSISTANT: The cats in the image are lying down on a red couch, possibly sleeping or rest',
'USER: Why is this video funny? ASSISTANT: The video is funny because the baby is playing with a Wii remote while sitting on a bed'
] # fmt: skip
zucchini-nlp marked this conversation as resolved.
Show resolved Hide resolved

Expand Down Expand Up @@ -519,7 +519,7 @@ def test_video_llava_merge_inputs_error_bug(self):
dtype=torch.float,
device=torch_device,
)
# fmt: off
# fmt: off
input_ids = torch.tensor(
[
[
Expand Down
Loading