Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: videos in LLaVa-OV #195

Merged
merged 1 commit into from
Aug 31, 2024
Merged

Fix: videos in LLaVa-OV #195

merged 1 commit into from
Aug 31, 2024

Conversation

zucchini-nlp
Copy link
Contributor

Currently running the demo notebook for LLaVA OneVision for video modality doesn't apply pooling for all video patches/frames, because the modality list holds values for each prompt, while videos can contain several frames. This PR replicates the modality list by copying it for all video frames in the demo notebook

I tried to see if we can expand the modalities inside modeling code, but seems like it's hard to infer which visual in the input is image or video, so I decided to delegate expansion to users.

@Luodian Luodian requested a review from kcz358 August 31, 2024 06:44
@Luodian Luodian merged commit 44c862e into LLaVA-VL:main Aug 31, 2024
@kcz358
Copy link
Collaborator

kcz358 commented Aug 31, 2024

Hi @zucchini-nlp , may I ask what is your token size when you printed out in your notebook? In #144 , I printed out the tokens size and it seems that all frames have been pooled.

@zucchini-nlp
Copy link
Contributor Author

zucchini-nlp commented Sep 2, 2024

@kcz358 it is 197 tokens per frame if I don't preprocess frames in anyres. I guess it should be exactly 196, right? And in that case we shouldn't be appending the newline token to videos?

I am now trying to make sense of how videos work since I am working on adding the model to transformers, thanks

@kcz358
Copy link
Collaborator

kcz358 commented Sep 2, 2024

@zucchini-nlp , I think there will be one newline token at the end of all video frames instead of one image_newline token at the end of each frame. The place I print in #144 is after the pooling but before the concat of image_newline. May I ask how did you process your video frames? If you provide the model frame by frame then it is likely you get 197 tokens per frame and multiple video modalities need to be provided. But I think if you provide the video in one batch then should be 196 tokens per frame and 1 image token at the end of all tokens

@zucchini-nlp
Copy link
Contributor Author

@kcz358 Right, I was providing it frame by frame using the demo notebook, but if I change it to one tensor per video as follows, it works as you described. Thanks!

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import cv2
import numpy as np
import requests
import copy
import torch

import sys
import warnings

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Load and process video
video_path = "/raid/raushan/karate.mp4"
video_frames = extract_frames(video_path)
image_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe what's happening in this video."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
print(image_tensor.shape)

cont = model.generate(
    input_ids,
    images=[image_tensor],
    do_sample=False,
    temperature=0,
    top_p=1.0,
    max_new_tokens=4096,
    modalities=["video"]
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants