[EMNLP 2025π₯] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
This is the official implementation for D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
by Yiyang Huang, Yizhou Wang, Yun Fu.
D-CoDe is a training-free framework for adapting image-pretrained vision-language models (VLMs) to video understanding. It achieves strong performance across multiple benchmarks, especially on long-video tasks, demonstrating its potential for complex video-language understanding.
The core implementation is in Dcode.py, which provides three main functions:
| Function | Description | Paper Method |
|---|---|---|
generate_subquestions() |
Decompose questions into sub-questions using GPT-3.5 | Question Decomposition |
supp_frame_selection() |
Select frames based on CLIP semantic similarity | Dynamic Compression (Frame) |
token_select_and_merge() |
Select and merge visual tokens to reduce redundancy | Dynamic Compression (Token) |
from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model
# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
question="What did the person do after picking up the cup?",
prompt_variant="original" # Options: "original", "no_background", "no_temporal_focus", "re"
)
# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
video_frames, # List of PIL Images
N=15, # Number of frames to select
uniform_ratio=0.85, # Ratio for uniform sampling
clip_model=clip_model,
clip_processor=clip_processor
)
# 3. Token Selection and Merge
merged_features = token_select_and_merge(
image_features, # Tensor (T, N, D)
top_k=288, # Tokens to keep per frame
merge_strategy="mean", # Options: "mean", "max", "weighted_mean"
similarity_threshold=0.8 # Similarity threshold for merging
)-
The code is developed with CUDA 11.7, Python >= 3.10.12, PyTorch >= 2.1.0
-
[Optional but recommended] Create a new conda environment.
conda create -n d_code python=3.10.12And activate the environment.
conda activate d_code -
Install the requirements.
bash setup_env.sh -
Add OpenAI key and organization to the system environment to use GPT-3.5-turbo for model evaluation.
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY export OPENAI_ORG=$YOUR_OPENAI_ORG # optional -
Download pre-trained LLaVA-NeXT weights from
HuggingFace, and put them under theDcodefolder.git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b
-
-
Ground-truth QA Files: The ground-truth question and answer CSV files are already included in playground/gt_qa_files. These files are prepared based on
IG-VLM.Available datasets:
- MSVD-QA (
MSVDQA.csv) - MSRVTT-QA (
MSRVTTQA.csv) - TGIF-QA (
TGIFFrameQA.csv) - ActivityNet-QA (
ActivityNetQA.csv) - NExT-QA (
Next_QA.csv) - EgoSchema (
EgoSchema.csv) - IntentQA (
IntentQA.csv)
- MSVD-QA (
-
Download Raw Videos: Download the raw videos from the official websites.
-
Open-end VideoQA
- [Recomanded] Option 1: Follow the instruction in
Video-LLaVAto download raw videos. - Option 2: Download videos from the data owners.
- [Recomanded] Option 1: Follow the instruction in
-
Multiple Choice VideoQA
-
-
Organize Videos: Organize the raw videos under playground/data.
-
To directly use our data loaders without changing paths, please organize your datasets as follows
$ Dcode/playground/data βββ video_qa βββ MSVD_Zero_Shot_QA βββ videos βββ ... βββ MSRVTT_Zero_Shot_QA βββ videos βββ all βββ ... βββ TGIF_Zero_Shot_QA βββ mp4 βββ ... βββ Activitynet_Zero_Shot_QA βββ all_test βββ ... βββ multiple_choice_qa βββ NExTQA βββ video βββ ... βββ EgoSchema βββ video βββ ... βββ IntentQA βββ video βββ ...
-
D-CoDe is a training-free method, so we can directly do the inference and evaluation without model training.
By default, we use 4 GPUs for the model inference. We can modify the CUDA_VISIBLE_DEVICES in the config file to accommodate your own settings.
cd Dcode
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE
- This is optional, but use
export PYTHONWARNINGS="ignore"if you want to suppress the warnings.
- The inference outputs will be stored under
outputs/artifacts. - The intermediate outputs of GPT-3.5-turbo will be stored under
outputs/eval_save_dir. - The evaluation results will be stored under
outputs/logs. - All of these can be changed in the config file.
We extend our gratitude to the following awesome projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.
If you find this work useful, please cite our paper:
@inproceedings{huang-etal-2025-code,
title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
author = "Huang, Yiyang and
Wang, Yizhou and
Fu, Yun",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
pages = "11798--11811",
}arXiv version:
@article{huang2025d,
title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
journal={arXiv preprint arXiv:2510.08818},
year={2025}
}