This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.
-
[2024.06.03] 🚀🚀 We officially launch VideoLLaMA2 with stronger performances and easier-to-use codebase, come to try it out!
- [11.14] ⭐️ The current README file is for Video-LLaMA-2 (LLaMA-2-Chat as language decoder) only, instructions for using the previous version of Video-LLaMA (Vicuna as language decoder) can be found at here.
- [08.03] 🚀🚀 Release Video-LLaMA-2 with Llama-2-7B/13B-Chat as language decoder
- NO delta weights and separate Q-former weights anymore, full weights to run Video-LLaMA are all here 👉 [7B][13B]
- Allow further customization starting from our pre-trained checkpoints [7B-Pretrained] [13B-Pretrained]
- [06.14] NOTE: The current online interactive demo is primarily for English chatting and it may NOT be a good option to ask Chinese questions since Vicuna/LLaMA does not represent Chinese texts very well.
- [06.13] NOTE: The audio support is ONLY for Vicuna-7B by now although we have several VL checkpoints available for other decoders.
- [06.10] NOTE: We have NOT updated the HF demo yet because the whole framework (with the audio branch) cannot run normally on A10-24G. The current running demo is still the previous version of Video-LLaMA. We will fix this issue soon.
- [06.08] 🚀🚀 Release the checkpoints of the audio-supported Video-LLaMA. Documentation and example outputs are also updated.
- [05.22] 🚀🚀 Interactive demo online, try our Video-LLaMA (with Vicuna-7B as language decoder) at Hugging Face and ModelScope!!
- [05.22] ⭐️ Release Video-LLaMA v2 built with Vicuna-7B
- [05.18] 🚀🚀 Support video-grounded chat in Chinese
- Video-LLaMA-BiLLA: we introduce BiLLa-7B-SFT as language decoder and fine-tune the video-language aligned model (i.e., stage 1 model) with machine-translated VideoChat instructions.
- Video-LLaMA-Ziya: same with Video-LLaMA-BiLLA but the language decoder is changed to Ziya-13B.
- [05.18] ⭐️ Create a Hugging Face repo to store the model weights of all the variants of our Video-LLaMA.
- [05.15] ⭐️ Release Video-LLaMA v2: we use the training data provided by VideoChat to further enhance the instruction-following capability of Video-LLaMA.
- [05.07] Release the initial version of Video-LLaMA, including its pre-trained and instruction-tuned checkpoints.
- Video-LLaMA is built on top of BLIP-2 and MiniGPT-4. It is composed of two core components: (1) Vision-Language (VL) Branch and (2) Audio-Language (AL) Branch.
- VL Branch (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
- A two-layer video Q-Former and a frame embedding layer (applied to the embeddings of each frame) are introduced to compute video representations.
- We train VL Branch on the Webvid-2M video caption dataset with a video-to-text generation task. We also add image-text pairs (~595K image captions from LLaVA) into the pre-training dataset to enhance the understanding of static visual concepts.
- After pre-training, we further fine-tune our VL Branch using the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat.
- AL Branch (Audio encoder: ImageBind-Huge)
- A two-layer audio Q-Former and an audio segment embedding layer (applied to the embedding of each audio segment) are introduced to compute audio representations.
- As the used audio encoder (i.e., ImageBind) is already aligned across multiple modalities, we train AL Branch on video/image instruction data only, just to connect the output of ImageBind to the language decoder.
- VL Branch (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
- Only the Video/Audio Q-Former, positional embedding layers, and linear layers are trainable during cross-modal training.
- Video with background sound
- Video without sound effects
- Static image
The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former, and linear projection layers) only.
The following checkpoints are the full weights (visual encoder + audio encoder + Q-Formers + language decoder) to launch Video-LLaMA:
Checkpoint | Link | Note |
---|---|---|
Video-LLaMA-2-7B-Pretrained | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
Video-LLaMA-2-7B-Finetuned | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
Video-LLaMA-2-13B-Pretrained | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
Video-LLaMA-2-13B-Finetuned | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
First, install ffmpeg.
apt update
apt install ffmpeg
Then, create a conda environment:
conda env create -f environment.yml
conda activate videollama
Before using the repository, make sure you have obtained the following checkpoints:
DON'T have to do anything now!!
Firstly, set the llama_model
(for the path to the language decoder), imagebind_ckpt_path
(for the path to the audio encoder), ckpt
(for the path to VL branch) and ckpt_2
(for the path to AL branch) in eval_configs/video_llama_eval_withaudio.yaml accordingly.
Then run the script:
python demo_audiovideo.py \
--cfg-path eval_configs/video_llama_eval_withaudio.yaml \
--model_type llama_v2 \ # or vicuna
--gpu-id 0
The training of each cross-modal branch (i.e., VL branch or AL branch) in Video-LLaMA consists of two stages,
-
Pre-training on the Webvid-2.5M video caption dataset and LLaVA-CC3M image caption dataset.
-
Fine-tuning using the image-based instruction-tuning data from MiniGPT-4/LLaVA and the video-based instruction-tuning data from VideoChat.
Download the metadata and video following the instructions from the official Github repo of Webvid. The folder structure of the dataset is shown below:
|webvid_train_data
|──filter_annotation
|────0.tsv
|──videos
|────000001_000050
|──────1066674784.mp4
|cc3m
|──filter_cap.json
|──image
|────GCC_train_000000000.jpg
|────...
Config the checkpoint and dataset paths in visionbranch_stage1_pretrain.yaml and audiobranch_stage1_pretrain.yaml respectively. Then, run the script:
conda activate videollama
# for pre-training VL branch
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/audiobranch_stage1_pretrain.yaml
# for pre-training AL branch
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/audiobranch_stage1_pretrain.yaml
For now, the fine-tuning dataset consists of:
- 150K image-based instructions from LLaVA [link]
- 3K image-based instructions from MiniGPT-4 [link]
- 11K video-based instructions from VideoChat [link]
Config the checkpoint and dataset paths in visionbranch_stage2_pretrain.yaml and audiobranch_stage2_pretrain.yaml respectively. Then, run the following script:
conda activate videollama
# for fine-tuning VL branch
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/visionbranch_stage2_finetune.yaml
# for fine-tuning AL branch
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/audiobranch_stage2_finetune.yaml
- Pre-training: 8xA100 (80G)
- Instruction-tuning: 8xA100 (80G)
- Inference: 1xA100 (40G/80G) or 1xA6000
We are grateful for the following awesome projects our Video-LLaMA arising from:
- MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
- FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- ImageBind: One Embedding Space To Bind Them All
- LLaMA: Open and Efficient Foundation Language Models
- VideoChat: Chat-Centric Video Understanding
- LLaVA: Large Language and Vision Assistant
- WebVid: A Large-scale Video-Text dataset
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
The logo of Video-LLaMA is generated by Midjourney.
Our Video-LLaMA is just a research preview intended for non-commercial use only. You must NOT use our Video-LLaMA for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.
If you find our project useful, hope you can star our repo and cite our paper as follows:
@article{damonlpsg2023videollama,
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
year = 2023,
journal = {arXiv preprint arXiv:2306.02858},
url = {https://arxiv.org/abs/2306.02858}
}