D-CoDe

[EMNLP 2025🔥] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

This is the official implementation for D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

by Yiyang Huang, Yizhou Wang, Yun Fu.

D-CoDe is a training-free framework for adapting image-pretrained vision-language models (VLMs) to video understanding. It achieves strong performance across multiple benchmarks, especially on long-video tasks, demonstrating its potential for complex video-language understanding.

Core Components

The core implementation is in Dcode.py, which provides three main functions:

Function	Description	Paper Method
`generate_subquestions()`	Decompose questions into sub-questions using GPT-3.5	Question Decomposition
`supp_frame_selection()`	Select frames based on CLIP semantic similarity	Dynamic Compression (Frame)
`token_select_and_merge()`	Select and merge visual tokens to reduce redundancy	Dynamic Compression (Token)

Quick Start

from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model

# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
    question="What did the person do after picking up the cup?",
    prompt_variant="original"  # Options: "original", "no_background", "no_temporal_focus", "re"
)

# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
    video_frames,           # List of PIL Images
    N=15,                   # Number of frames to select
    uniform_ratio=0.85,     # Ratio for uniform sampling
    clip_model=clip_model,
    clip_processor=clip_processor
)

# 3. Token Selection and Merge
merged_features = token_select_and_merge(
    image_features,                  # Tensor (T, N, D)
    top_k=288,                       # Tokens to keep per frame
    merge_strategy="mean",           # Options: "mean", "max", "weighted_mean"
    similarity_threshold=0.8         # Similarity threshold for merging
)

Getting Started

Installation

The code is developed with CUDA 11.7, Python >= 3.10.12, PyTorch >= 2.1.0
1. [Optional but recommended] Create a new conda environment.
```
conda create -n d_code python=3.10.12
```
  And activate the environment.
```
conda activate d_code
```
2. Install the requirements.
```
bash setup_env.sh
```
3. Add OpenAI key and organization to the system environment to use GPT-3.5-turbo for model evaluation.
```
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
export OPENAI_ORG=$YOUR_OPENAI_ORG  # optional
```
4. Download pre-trained LLaVA-NeXT weights from HuggingFace, and put them under the Dcode folder.
```
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b
```

Data Preparation

Ground-truth QA Files: The ground-truth question and answer CSV files are already included in playground/gt_qa_files. These files are prepared based on IG-VLM.

Available datasets:
- MSVD-QA (MSVDQA.csv)
- MSRVTT-QA (MSRVTTQA.csv)
- TGIF-QA (TGIFFrameQA.csv)
- ActivityNet-QA (ActivityNetQA.csv)
- NExT-QA (Next_QA.csv)
- EgoSchema (EgoSchema.csv)
- IntentQA (IntentQA.csv)
Download Raw Videos: Download the raw videos from the official websites.
- Open-end VideoQA
  - [Recomanded] Option 1: Follow the instruction in Video-LLaVA to download raw videos.
  - Option 2: Download videos from the data owners.
- Multiple Choice VideoQA
  - Download datasets from the data owners.

Organize Videos: Organize the raw videos under playground/data.

To directly use our data loaders without changing paths, please organize your datasets as follows

$ Dcode/playground/data
    ├── video_qa
        ├── MSVD_Zero_Shot_QA
            ├── videos
                ├── ...
        ├── MSRVTT_Zero_Shot_QA
            ├── videos
                ├── all
                    ├── ...
        ├── TGIF_Zero_Shot_QA
           ├── mp4
               ├── ...
        ├── Activitynet_Zero_Shot_QA
           ├── all_test
               ├── ...
    ├── multiple_choice_qa
        ├── NExTQA
            ├── video
               ├── ...
        ├── EgoSchema
            ├── video
               ├── ...
        ├── IntentQA
            ├── video
               ├── ...

Inference and Evaluation

D-CoDe is a training-free method, so we can directly do the inference and evaluation without model training.

By default, we use 4 GPUs for the model inference. We can modify the CUDA_VISIBLE_DEVICES in the config file to accommodate your own settings.

cd Dcode
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE

This is optional, but use export PYTHONWARNINGS="ignore" if you want to suppress the warnings.

Output Structures

The inference outputs will be stored under outputs/artifacts.
The intermediate outputs of GPT-3.5-turbo will be stored under outputs/eval_save_dir.
The evaluation results will be stored under outputs/logs.
All of these can be changed in the config file.

Acknowledgement

We extend our gratitude to the following awesome projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.

Citations

If you find this work useful, please cite our paper:

@inproceedings{huang-etal-2025-code,
    title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
    author = "Huang, Yiyang  and
      Wang, Yizhou  and
      Fu, Yun",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    pages = "11798--11811",
}

arXiv version:

@article{huang2025d,
    title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
    author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
    journal={arXiv preprint arXiv:2510.08818},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
eval		eval
outputs-logs/artifacts		outputs-logs/artifacts
playground/gt_qa_files		playground/gt_qa_files
scripts		scripts
slowfast_llava/llava		slowfast_llava/llava
.gitignore		.gitignore
Dcode.py		Dcode.py
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
pipeline.png		pipeline.png
prompt.py		prompt.py
run_inference_multiple_choice_qa.py		run_inference_multiple_choice_qa.py
run_inference_video_qa.py		run_inference_video_qa.py
setup_env.sh		setup_env.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-CoDe

Table of contents

Core Components

Quick Start

Getting Started

Installation

Data Preparation

Inference and Evaluation

Output Structures

Acknowledgement

Citations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

hukcc/D-CoDe

Folders and files

Latest commit

History

Repository files navigation

D-CoDe

Table of contents

Core Components

Quick Start

Getting Started

Installation

Data Preparation

Inference and Evaluation

Output Structures

Acknowledgement

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages