GitHub - AntResearchNLP/ViLaMP: [ICML 2025] Official repository for paper "Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation"

ViLaMP: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

If our project helps you, please give us a star ⭐ and cite our paper

Overview

⚙️ Setup

git clone https://github.com/AntResearchNLP/ViLaMP.git
cd ViLaMP

Environment

# python 3.8
pip install -r requirements.txt

Install ViLaMP

Please install 🤗ViLAMP-llava-qwen and place it into models/

🎯 Inference

We provide evaluation scripts for five benchmarks: Video-MME, MLVU, LongVideoBench, ActivityNetQA, and EgoSchema. For more details, please refer to scripts/eval/.

Before running these scripts, please download the evaluation datasets and place them in the dataset/ directory. Below are the instructions for the input arguments:

--dataset_path  Path to the test dataset
--video_dir    Path to the folder containing the videos required for testing
--output_dir   Path to the folder where the results will be saved
--version      Path to the model
--split        Split the dataset in the format i_N, where N is the total number of splits and i is the current split index (starting from 1). Default is 1_1.
--max_frame_num Maximum number of frames to sample per video. Default is 600.

Here is the example of evaluating Video-MME

python exp_vMME.py \
    --dataset_path dataset/Video-MME/videomme/test-00000-of-00001.parquet \
    --video_dir dataset/Video-MME/data \
    --output_dir dataset/Video-MME/output \
    --version models/ViLAMP-llava-qwen \
    --split 1_1 \
    --max_frame_num 600

🚀 Training

We offer training scripts designed for both single-node and multi-node environments. For more detailed instructions, please check the scripts/train/ directory.

To streamline dataset organization, we utilize the training_data.yaml file to consolidate training data. Before initiating training, ensure that your dataset is registered in this file. We have included a straightforward example example-10.json to demonstrate the expected dataset format.

Note that if your training dataset requires specific processing methods, you will need to modify line 1225 of the llava/train/train.py file and insert your custom video processing function prior to this line to accommodate your dataset's needs.

Citation

@article{cheng2025vilamp,
  title={Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation},
  author={Cheng, Chuanqi and Guan, Jian and Wu, Wei and Yan, Rui},
  journal={arXiv preprint arXiv:2504.02438},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
dataset		dataset
imgs		imgs
llava		llava
scripts		scripts
trl		trl
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViLaMP: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

If our project helps you, please give us a star ⭐ and cite our paper

Overview

⚙️ Setup

Environment

Install ViLaMP

🎯 Inference

🚀 Training

Citation

About

Uh oh!

Releases

Packages

Languages

AntResearchNLP/ViLaMP

Folders and files

Latest commit

History

Repository files navigation

ViLaMP: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

If our project helps you, please give us a star ⭐ and cite our paper

Overview

⚙️ Setup

Environment

Install ViLaMP

🎯 Inference

🚀 Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages