-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The llava-onevision model video inference code has an error #144
Comments
@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue? |
I think there is a small error in the jupyter notebook. Passing |
Sorry we found that we wrongly added some video specific logics in our Now we revert it and please try with updated code, thanks! |
@Luodian @ZhangYuanhan-AI @kcz358 The reason the first dimension of the process_images output for a single image is 16 is that image_aspect_ratio="anyres_max_9".The "anyres_max_9" parameter is applicable to single image inference, not to video inference. I tested this using the latest code you modified, and the result is the same. GPU memory usage is still very high (about 57GB for 24 frames). The generated tensor does not have a shape of 196.
|
LLaVA-NeXT/llava/model/llava_arch.py Lines 254 to 258 in 16dbbb3
Yes, I agree with you. There is an error in the tutorial again. Should not use the You should use image processor to handle the frame instead image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().cuda() Thank you for pointing it out and I will check with it later and revise the notebook |
The code you referenced is located in the encode_multimodals method. However, in the main branch of the llava_arch.py code, encode_multimodals is commented out. @kcz358 |
These lines contain the processing logic, not the encode_multimodals. LLaVA-NeXT/llava/model/llava_arch.py Lines 232 to 236 in 3fbf54b
|
I did as you said and replaced "process_images" with "image_processor". I printed out the shape after the statement "image_features.append(self.get_2dPool(image_feat))", but still no 196 appeared. I am using the llava-onevision-qwen2-7b-ov version and conducted both local and online tests on the same video (https://llava-onevision.lmms-lab.com/). The results were "yes" and "no," respectively. The prompt was "Is the model changing clothes in the video? Answer the question using a single word or phrase." Clearly, the online result was correct, the local result is was wrong. |
The problem is actually you are still processing the video with incorrect logic even though you are using All the video frames are being pooled correctly. Hope it would help |
Thank you Kaichen, it's great to see the problem has been addressed, also tested my side and it works. |
I understand now. In my original approach, I only passed in [video], so the video only read a single frame. The subsequent frames were all processed as images. |
Many thanks for your question. In the toturial, it works normally. But in the video inference code upon evaluation benchmarks, would it still incur huge memory costs? |
Yes, the |
Thanks, the |
Thanks for your reply. I will try it. |
For the llava-onevision model, the official video inference code does not modify the
image_aspect_ratio
parameter, resulting in the use of the defaultanyres_max_9
. This causes theimage_features
to occupy a huge amount of GPU memory during inference. Is this problematic? After all, the paper states that each frame consists of 196 tokens, but usinganyres_max_9
results in a number of tokens per frame far exceeding 196. Relevant links are as follows:https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb
#142
Additionally, why can't I see the logic for each frame corresponding to 196 tokens in the code?
The text was updated successfully, but these errors were encountered: