Skip to content
/ D-CoDe Public

[EMNLP 2025πŸ”₯] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

License

Notifications You must be signed in to change notification settings

hukcc/D-CoDe

Repository files navigation

D-CoDe

[EMNLP 2025πŸ”₯] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

This is the official implementation for D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

by Yiyang Huang, Yizhou Wang, Yun Fu.

D-CoDe is a training-free framework for adapting image-pretrained vision-language models (VLMs) to video understanding. It achieves strong performance across multiple benchmarks, especially on long-video tasks, demonstrating its potential for complex video-language understanding.

Table of contents

Core Components

The core implementation is in Dcode.py, which provides three main functions:

Function Description Paper Method
generate_subquestions() Decompose questions into sub-questions using GPT-3.5 Question Decomposition
supp_frame_selection() Select frames based on CLIP semantic similarity Dynamic Compression (Frame)
token_select_and_merge() Select and merge visual tokens to reduce redundancy Dynamic Compression (Token)

Quick Start

from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model

# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
    question="What did the person do after picking up the cup?",
    prompt_variant="original"  # Options: "original", "no_background", "no_temporal_focus", "re"
)

# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
    video_frames,           # List of PIL Images
    N=15,                   # Number of frames to select
    uniform_ratio=0.85,     # Ratio for uniform sampling
    clip_model=clip_model,
    clip_processor=clip_processor
)

# 3. Token Selection and Merge
merged_features = token_select_and_merge(
    image_features,                  # Tensor (T, N, D)
    top_k=288,                       # Tokens to keep per frame
    merge_strategy="mean",           # Options: "mean", "max", "weighted_mean"
    similarity_threshold=0.8         # Similarity threshold for merging
)

Getting Started

Installation

  • The code is developed with CUDA 11.7, Python >= 3.10.12, PyTorch >= 2.1.0

    1. [Optional but recommended] Create a new conda environment.

      conda create -n d_code python=3.10.12
      

      And activate the environment.

      conda activate d_code
      
    2. Install the requirements.

      bash setup_env.sh
      
    3. Add OpenAI key and organization to the system environment to use GPT-3.5-turbo for model evaluation.

      export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
      export OPENAI_ORG=$YOUR_OPENAI_ORG  # optional
      
    4. Download pre-trained LLaVA-NeXT weights from HuggingFace, and put them under the Dcode folder.

      git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b
      

Data Preparation

  1. Ground-truth QA Files: The ground-truth question and answer CSV files are already included in playground/gt_qa_files. These files are prepared based on IG-VLM.

    Available datasets:

    • MSVD-QA (MSVDQA.csv)
    • MSRVTT-QA (MSRVTTQA.csv)
    • TGIF-QA (TGIFFrameQA.csv)
    • ActivityNet-QA (ActivityNetQA.csv)
    • NExT-QA (Next_QA.csv)
    • EgoSchema (EgoSchema.csv)
    • IntentQA (IntentQA.csv)
  2. Download Raw Videos: Download the raw videos from the official websites.

  3. Organize Videos: Organize the raw videos under playground/data.

    • To directly use our data loaders without changing paths, please organize your datasets as follows

      $ Dcode/playground/data
          β”œβ”€β”€ video_qa
              β”œβ”€β”€ MSVD_Zero_Shot_QA
                  β”œβ”€β”€ videos
                      β”œβ”€β”€ ...
              β”œβ”€β”€ MSRVTT_Zero_Shot_QA
                  β”œβ”€β”€ videos
                      β”œβ”€β”€ all
                          β”œβ”€β”€ ...
              β”œβ”€β”€ TGIF_Zero_Shot_QA
                 β”œβ”€β”€ mp4
                     β”œβ”€β”€ ...
              β”œβ”€β”€ Activitynet_Zero_Shot_QA
                 β”œβ”€β”€ all_test
                     β”œβ”€β”€ ...
          β”œβ”€β”€ multiple_choice_qa
              β”œβ”€β”€ NExTQA
                  β”œβ”€β”€ video
                     β”œβ”€β”€ ...
              β”œβ”€β”€ EgoSchema
                  β”œβ”€β”€ video
                     β”œβ”€β”€ ...
              β”œβ”€β”€ IntentQA
                  β”œβ”€β”€ video
                     β”œβ”€β”€ ...
      

Inference and Evaluation

D-CoDe is a training-free method, so we can directly do the inference and evaluation without model training.

By default, we use 4 GPUs for the model inference. We can modify the CUDA_VISIBLE_DEVICES in the config file to accommodate your own settings.

cd Dcode
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE
  • This is optional, but use export PYTHONWARNINGS="ignore" if you want to suppress the warnings.

Output Structures

  • The inference outputs will be stored under outputs/artifacts.
  • The intermediate outputs of GPT-3.5-turbo will be stored under outputs/eval_save_dir.
  • The evaluation results will be stored under outputs/logs.
  • All of these can be changed in the config file.

Acknowledgement

We extend our gratitude to the following awesome projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.

Citations

If you find this work useful, please cite our paper:

@inproceedings{huang-etal-2025-code,
    title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
    author = "Huang, Yiyang  and
      Wang, Yizhou  and
      Fu, Yun",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    pages = "11798--11811",
}

arXiv version:

@article{huang2025d,
    title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
    author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
    journal={arXiv preprint arXiv:2510.08818},
    year={2025}
}

About

[EMNLP 2025πŸ”₯] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •