AskVideos-VideoCLIP

Pre-trained & Fine-tuned Checkpoints

Checkpoint	Link
AskVideos-VideoCLIP-v0.1	link

AskVideos-VideoCLIP is a language-grounded video embedding model.
16 frames are sampled from each video clip to generate a video embedding.
The model is trained on 2M clips from WebVid and 1M clips from the AskYoutube dataset.
The model is trained with contrastive and captioning loss to ground the video embeddings to text.

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda create -n askvideosclip python=3.9 
conda activate askvideosclip

Then, install the requiremnts:

pip3 install -U pip
pip3 install -r requirements.txt

python video_clip.py

The demo is also available to run on colab.

AskVideos code and models are distributed under the Apache 2.0 license.

This model is built off of the Video-LLaMA Video-Qformer model.