Joint Video-Text embeddings for search, classification and more.
- AskVideos-VideoCLIP is a language-grounded video embedding model.
- This model produces a single context-aware embedding for each video clip.
- 16 frames are sampled from each video clip to generate a video embedding.
- The model is trained with contrastive and captioning loss to ground the video embeddings to text.
Checkpoint | Link |
---|---|
AskVideos-VideoCLIP-v0.1 | link |
AskVideos-VideoCLIP-v0.2 | link |
AskVideos-VideoCLIP-v0.3 | link |
The demo is also available to run on colab.
Model | Colab link |
---|---|
AskVideos-VideoCLIP-v0.1 | link |
AskVideos-VideoCLIP-v0.2 | link |
First, install ffmpeg.
apt update
apt install ffmpeg
Then, create a conda environment:
conda create -n askvideosclip python=3.9
conda activate askvideosclip
Then, install the requiremnts:
pip3 install -U pip
pip3 install -r requirements.txt
python video_clip.py
AskVideos code and models are distributed under the Apache 2.0 license.
This model is inspired by the Video-LLaMA Video-Qformer model.
bibtex
@misc{askvideos2024videoclip,
title = {AskVideos-VideoCLIP: Language-grounded video embeddings},
author = {AskVideos},
year = {2024},
howpublished = {GitHub},
url = {https://github.com/AskYoutubeAI/AskVideos-VideoCLIP}
}