-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Video Llava #29733
Add Video Llava #29733
Conversation
@LinB203 hey! As we talked before, here is a draft PR of Video Llava. I checked that the modeling part runs without errors and generates similar to the original repo. To update model files on the hub, you can use |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@LinB203 pinging in case the first one got lost in notifications :) |
FYI @NielsRogge and @amyeroberts |
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
I believe we can start reviewing this now. I converted weights and added them to my hub account temporarily, so that we can run and test the model. In the meanwhile will be waiting for @LinB203 to update the weights in the organization hub, which I think is required before we merge the PR |
@zucchini-nlp Awesome work! First thing to do before a full review is resolving the failing tests. Some of these are unrelated, and rebasing on main should resolve. Looking at the CI, some of the failure as video llava specific - let me know if you need any help addressing them! |
Rebased with main and resolved conflicts. The only failing doctest seems to be not able to load and run 7b model in 120sec, but I think we will leave it anyway to show how Video-Llava works |
@zucchini-nlp You can exclude the model from running in the doc tests by adding it to slow_documentation_tests.txt. Then, once the PR is reviewed in a steady state and ready for merging, we can run the slow tests and the documentation tests to make sure everything is correct before merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work adding this model - looks great!
src/transformers/models/video_llava/configuration_video_llava.py
Outdated
Show resolved
Hide resolved
src/transformers/models/video_llava/configuration_video_llava.py
Outdated
Show resolved
Hide resolved
```bash | ||
"USER: <video>\n<prompt> ASSISTANT:" | ||
``` | ||
|
||
For multiple turns conversation: | ||
|
||
```bash | ||
"USER: <video>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand this to show a full example using the model? Users typically want to just copy-paste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the image processor takes both images
and videos
as input, and only one of them is required, then setting image = None
seems reasonable
input_ids, | ||
attention_mask, | ||
labels, | ||
num_frames=8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been added? Skimming I didn't spot but might have just missed
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library |
Thanks for your great attention and employing the model in transformers. I wonder that which organization you mentioned? |
@LinB203 I mean the weights that can loaded into the |
I see. I will make |
Finished. https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf/tree/main |
@amyeroberts done! The last commit should trigger the slow tests after your approval. Note about the failing code-check, it's failing because of the |
@zucchini-nlp Great! :D Could you rebase to include the upstream changes like the |
The PR passed all the tests, slow tests are passing for me locally. Should be good to go |
@zucchini-nlp Great - let's merge! Do you have permission to do so? I can merge if not. |
Just fixed one slow test, will merge when i get all green |
* add model draft * update docstring * add tests * support image and video as input * update for better handling of mixed input and clean-up a bit * bug when mixed inputs & add tests * Update README.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Merge remote-tracking branch 'upstream/main' into video_llava * link to abstract of paper in README * fix test * fix-copies * make tests happy * skip docstest for now * do not run doctest for now * Update src/transformers/models/video_llava/processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * address review comments * failing tests * Fix vocab_size in common tests for VLMs * codestyle * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/video_llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/video_llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/image_processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/video_llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/processing_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/video_llava/test_modeling_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * PR suggestions * fix-copies * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/video_llava/configuration_video_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add full example in docs * clean-up with new model-id * [run-slow] video_llava * update docstring * [run-slow] video_llava * remove all achive maps * fix some tests * test was supposed to be skipped for llava :) --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
Fixes #29640 . Add a new model, Video Llava to the library.
This is a draft PR, will add more here later.
The model now accepts both, video and image as input in the same batch. Each visual has its own special token, so we do not need to repeat "" 8 times for 8 frames. Here is a short code snippet: