[NeurIPS 2024] On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
This repository is the official implementation of MM-Det [NeurIPS 2024 Poster].
> |
- [NeurIPS 2024] On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
- Install basic packages
conda create -n MM_Det python=3.10
conda activate
pip install -r requirements.txt
cd LLaVA
pip install -e .- For training cases, install additional packages
cd LLaVA
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn==2.5.8 --no-build-isolationWe release Diffusion Video Forensics (DVF) as the benchmark for forgery video detection.
The full version of DVF can be downloaded via links: BaiduNetDisk(Code: 296c). google driver
We also release a tiny version of DVF for quickstart, in which each dataset contains 10 videos, with each video no more than 100 frames. This tiny version can be downloaded via BaiduNetDisk (Code:77x3), google driver. We also provide the corresponding reconsturction dataset and MM representations for evaluation in the above link. More information for evaluation can be found at here.
![]() |
We provide the weights for our fine-tuned large multi-modal model, which is based on llava-v1.5-Vicuna-7b from LLaVA. The overall weights for MM-Det without the LMM can be obtained via weights at MM-Det/current_model.pth. Please download and put the weights at ./weights/.
For the full version of DVF, we provide a ready reconstruction dataset at huggingface BaiduNetDisk (Code: l8h4).
For the full version of DVF, we provide a ready dataset for cached MMFR at huggingface BaiduNetDisk (Code: m6uy). Since the representation is fixed during training and inference, it is recommended to cache the representation before the overall training to reduce time cost.
For evaluation on the tiny version of DVF, put all files of the tiny version into ./data. Using follow unzipping commands.
zip -FF DVF_recons_tiny.zip --out DVF_recons_full.zip
for file in DVF_recons_full.zip DVF_tiny.zip mm_representations_tiny.zip; do unzip "$file"; done
Then, the data structure will be as follows:
-- data
| -- DVF_tiny
| -- DVF_recons_tiny # $RECONSTRUCTION_DATASET_ROOT
| -- mm_representations_tiny # $MM_REPRESENTATION_ROOT
For evaluation on the full version of DVF, download the data at Reconstruction Dataset and Multi-Modal Forgery Representation. Then put them into ./data. The data structure is organized as follows:
-- data
| -- DVF
| -- DVF_recons # $RECONSTRUCTION_DATASET_ROOT
| -- mm_representations # $MM_REPRESENTATION_ROOT
For evaluation on customized dataset, details of data preparation can be found at dataset/readme.md.
Make sure the pre-trained weights are organized at ./weights.
Please set $RECONSTRUCTION_DATASET_ROOT and $MM_REPRESENTATION_ROOT as the data roots provided at Data Structure in launch-test.sh.
--cache-mm is recommended for save the computational and memory cost of LMM branch. Then run launch-test.sh for testing on 7 datasets respectively.
python test.py \
--classes videocrafter1 zeroscope opensora sora pika stablediffusion stablevideo \
--ckpt ./weights/MM-Det/current_model.pth \
--data-root $RECONSTRUCTION_DATASET_ROOT \
--cache-mm \
--mm-root $MM_REPRESENTATION_ROOT\
--sample-size -1 \
## --lmm-ckpt llava-7b-1.5-rfrd ## In case one cannot download via the huggingface interface and then can load from the local. Since the entire evaluation is time-costing, sample-size can be specified (e.g., 1,000) to reduce time by conducting inference only on limited (1,000) videos. To finish the entire evaluation, please set sample-size as -1.
Make sure the data preparation is done at dataset/readme.md.
Please set $YOUR_CUSTOMIZED_DATA_ROOT $YOUR_CUSTOMIZED_CLASS_NAME and $MM_REPRESENTATION_ROOT in launch-test-customized.sh.
Then run launch-test-customized.sh.
Our LMM branch is built upon LLaVA, with llava-v1.5-Vicuna-7b set as the base model. Our fine-tuned LMM weights can be achieved here(#pretrained-weights). It is recommended to start the overall training directly using our pretrained LMM weights, otherwise the fine-tuning result may not be steady.
We directly conduct the visual instruction tuning stage in LLaVA on a gemini-generated instruction dataset RFRD. For more information on customized LMM fine-tuning, please refer to LLaVA.
Our ST backbone is based on Hybrid-ViT at pytorch-image-models. Our model is based on vit_base_resnet50_224_in21k. To start training from a pretrained model, you can get the pretrained weights at pytorch-image-models. We also provide a direct link at ViT/vit_base_r50_s16_224.orig_in21k. Please put downloaded weights at ./weights/.
Run the following scripts launch-train.bash for an overall training on our model. It is recommended to first cache Multi-Modal Forgery Representations at $MM_REPRESENTATION_ROOT. In this case, --cache-mm is specified, and LMM branch will not be loaded to save huge computational costs and memory usage.
python train.py \
--data-root $RECONSTRUCTION_DATASET_ROOT \
--classes youtube stablevideodiffusion \
--cache-mm \
--mm-root $MM_REPRESENTATION_ROOT \
--expt $EXPT_NAME \We express our sincere appreciation to the following projects.
- LLaVA
- pytorch-image-models
- pytorch-vqvae
- Stable Diffusion
- VideoCrafter1
- Zeroscope
- OpenSora
- Stable Video Diffusion.
@inproceedings{ on-learning-multi-modal-forgery-representation-for-diffusion-generated-video-detection,
author = { Xiufeng Song and Xiao Guo and Jiache Zhang and Qirui Li and Lei Bai and Xiaoming Liu and Guangtao Zhai and Xiaohong Liu },
title = { On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection },
booktitle = { Proceeding of Thirty-eighth Conference on Neural Information Processing Systems },
address = { Vancouver, Canada },
month = { December },
year = { 2024 },
}

