-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: videos in LLaVa-OV #195
Conversation
Hi @zucchini-nlp , may I ask what is your token size when you printed out in your notebook? In #144 , I printed out the tokens size and it seems that all frames have been pooled. |
@kcz358 it is 197 tokens per frame if I don't preprocess frames in anyres. I guess it should be exactly 196, right? And in that case we shouldn't be appending the I am now trying to make sense of how videos work since I am working on adding the model to transformers, thanks |
@zucchini-nlp , I think there will be one newline token at the end of all video frames instead of one image_newline token at the end of each frame. The place I print in #144 is after the pooling but before the concat of image_newline. May I ask how did you process your video frames? If you provide the model frame by frame then it is likely you get 197 tokens per frame and multiple |
@kcz358 Right, I was providing it frame by frame using the demo notebook, but if I change it to one tensor per video as follows, it works as you described. Thanks! from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import cv2
import numpy as np
import requests
import copy
import torch
import sys
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
cap = cv2.VideoCapture(video_path)
frames = []
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
for i in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, i)
ret, frame = cap.read()
if ret:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(Image.fromarray(frame))
cap.release()
return frames
# Load and process video
video_path = "/raid/raushan/karate.mp4"
video_frames = extract_frames(video_path)
image_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe what's happening in this video."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
print(image_tensor.shape)
cont = model.generate(
input_ids,
images=[image_tensor],
do_sample=False,
temperature=0,
top_p=1.0,
max_new_tokens=4096,
modalities=["video"]
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
|
Currently running the demo notebook for LLaVA OneVision for video modality doesn't apply pooling for all video patches/frames, because the
modality
list holds values for each prompt, while videos can contain several frames. This PR replicates themodality
list by copying it for all video frames in the demo notebookI tried to see if we can expand the modalities inside modeling code, but seems like it's hard to infer which visual in the input is image or video, so I decided to delegate expansion to users.