In this repo, we provide two things:
- Pre-extracted feature vectors obtained using Twelve Labs' video foundation model
- Pytorch evaluation code to evaluate & utilize the embeddings
We hope that (1) the published embeddings will help to achieve high performance in various downstream tasks and will be valuable for research, and (2) the evluation source code will be a good baseline code for researchers / developers studying the video foundation models.
Please refer to our technical report for the further details of the evaluation pipeline.
All results will be saved in ./results
directory.
- Linear Probing
- Kinetics-400
- Something-Something-v2
- Moments-in-Time
- Diving 48
- K-Nearest-Neighbor
- Kinetics-400
- Something-Something-v2
- Moments-in-Time
- Diving 48
- Temporal Action Localization
- ActivityNet v1.3
- THUMOS14
- Temporal Action Segmentation
- 50Salads
- Breakfast
- GTEA
- Embedding Visualization
- Kinetics-400
- Something-Something-v2
- Moments-in-Time
- Diving 48
- Some of the benchmark folders are organized according to how they sample frames (
uniform
ormulti-clip
). If you enter the top-level folder of the dataset, or the directory corresponding to each sampling, you will arrive at the location of thetrain
ortrain
/val
folder. The current directory in this state is the--embeddings_dir
in each downstream task. - There are three files corresponding to a video.
[video_id].json
- This
json
file contains the label corresponding to the video, as well as meta data about the duration, number of frames, and the start and end times of each subclip. Exceptionally, the label for temporal action segmentation utilizes external files rather than thisjson
file.
- This
[video_id]_c.npy
- This file contains embedding vectors for each subclip of the video in the form (number of subclips) x (dimension).
[video_id]_v.npy
- This file contains one embedding vector that represents the entire video. Same as
[video_id]_c.npy
for uniform sampling or when only one clip is defined for the entire video.
- This file contains one embedding vector that represents the entire video. Same as
If you think this project is helpful, please feel free to leave a star and cite our paper:
@inproceedings{twelvelabs2024twlv,
title={TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models},
author={Twelve Labs},
year={2024}
}