Checkpoint | Link |
---|---|
AskVideos-VideoCLIP-v0.1 | link |
- AskVideos-VideoCLIP is a language-grounded video embedding model.
- 16 frames are sampled from each video clip to generate a video embedding.
- The model is trained on 2M clips from WebVid and 1M clips from the AskYoutube dataset.
- The model is trained with contrastive and captioning loss to ground the video embeddings to text.
First, install ffmpeg.
apt update
apt install ffmpeg
Then, create a conda environment:
conda create -n askvideosclip python=3.9
conda activate askvideosclip
Then, install the requiremnts:
pip3 install -U pip
pip3 install -r requirements.txt
python video_clip.py
The demo is also available to run on colab.
AskVideos code and models are distributed under the Apache 2.0 license.
This model is built off of the Video-LLaMA Video-Qformer model.