[ICCV2025] [FakeSTormer] Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
This is an official implementation of FakeSTormer! [📜Paper]
- Comming soon: Release of code and pretrained weights ⏳.
- 08/07/2025: First version pre-released for this open source code 🌱.
- 26/06/2025: FakeSTormer has been accepted to ICCV2025 🎉.
Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods.
Results on 6 datasets (CDF2, DFW, DFD, DFDC, DFDCP, and DiffSwap) under cross-dataset evaluation setting reported by AUC (%) at video-level.
| CDF2 | DFW | DFD | DFDC | DFDCP | DiffSwap | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
For experiment purposes, we encourage the installment of the following libraries. Both Conda or Python virtual env should work.
- CUDA: 11.4
- Python: >= 3.8.x
- PyTorch: 1.8.0
- TensorboardX: 2.5.1
- ImgAug: 0.4.0
- Scikit-image: 0.17.2
- Torchvision: 0.9.0
- Albumentations: 1.1.0
- mmcv: 1.6.1
- natsort: 8.4.0
- 📌 The pre-trained weights will be released soon!
We further provide an optional Docker file that can be used to build a working env with Docker. More detailed steps can be found here.
- Install docker to the system (skip the step if docker has already been installed):
sudo apt install docker
- To start your docker environment, please go to the folder dockerfiles:
cd dockerfiles - Create a docker image (you can put any name you want):
docker build --tag 'fakestormer' .
-
Preparation
-
Prepare environment
Installing main packages as the recommended environment. Note that we recommend building mmcv from source as below.
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.6.1
MMCV_WITH_OPS=1 pip install -e . -
Prepare dataset
-
Downloading FF++ Original dataset for training data preparation. Following the original split convention, it is firstly used to randomly extract frames and facial crops:
python package_utils/images_crop.py -d {dataset} \ -c {compression} \ -n {num_frames} \ -t {task}(This script can also be utilized for cropping faces in other datasets such as CDF2, DFD, DFDCP, DFDC for cross-evaluation test. You do not need to run crop for DFW as the data is already preprocessed).
Parameter Value Definition -d Subfolder in each dataset. For example: ['Face2Face','Deepfakes','FaceSwap','NeuralTextures', ...] You can use one of those datasets. -c ['raw','c23','c40'] You can use one of those compressions -n 256 Number of frames (default 32 for val/test and 256 for train) -t ['train', 'val', 'test'] Default train These faces cropped are saved for online pseudo-fake generation in the training process, following the data structure below:
ROOT = '/data/deepfake_cluster/datasets_df' └── Celeb-DFv2 └──... └── FF++ └── c0 └── c23 ├── test │ └── videos │ └── Deepfakes | ├── 000_003 | ├── 044_945 | ├── 138_142 | ├── ... │ ├── Face2Face │ ├── FaceSwap │ ├── NeuralTextures │ └── original | └── frames ├── train │ └── videos │ └── aligned | ├── 001 | ├── 002 | ├── ... │ └── original | ├── 001 | ├── 002 | ├── ... | └── frames └── val └── videos ├── aligned └── original └── frames └── c40 -
Downloading Dlib [81] facial landmarks detector pretrained and place into
/pretrained/for SBI synthesis. -
Landmarks detection. After completing the following script running, a file that stores metadata information of the data is saved at
processed_data/c23/{SPLIT}_FaceForensics_videos_<n_landmarks>.json.python package_utils/geo_landmarks_extraction.py \ --config configs/data_preprocessing_c23.yaml \ --extract_landmarks
-
-
-
Training script
We offer a number of config files for different compression levels of training data. For c23, opening
configs/temporal/FakeSTormer_base_c23.yaml, please make sure you setTRAIN: TrueandFROM_FILE: Trueand run:.scripts/fakestormer_sbi.shOtherwise, with [c0, c40], the config file is
configs/temporal/FakeSTormer_base_[c0, c40].yaml. You can also find other configs for other network architectures in theconfigs/folder. -
Testing script
Opening
configs/temporal/FakeSTormer_base_c23.yaml, withsubtask: evalin the test section, we support evaluation mode, please turn offTRAIN: FalseandFROM_FILE: Falseand run:.scripts/test_fakestormer.shFor others (.e.g., data compression levels, network architectures), please change the path of the coressponding config file.
⚠️ Please make sure you set the correct path to your download pre-trained weights in the config files.ℹ️ Flip test can be used by setting
flip_test: Trueℹ️ The mode for single video inference is also provided, please set
sub_task: test_vidand pass a video path as an argument in test.py
Please contact dat.nguyen@uni.lu. Any questions or discussions are welcomed!
This software is © University of Luxembourg and is licensed under the snt academic license. See LICENSE
We acknowledge the excellent implementation from OpenMMLab (mmengine, mmcv), SBI, and LAA-Net.
Please kindly consider citing our papers in your publications.
@article{nguyen2025vulnerability,
title={Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection},
author={Nguyen, Dat and Astrid, Marcella and Kacem, Anis and Ghorbel, Enjie and Aouada, Djamila},
journal={arXiv preprint arXiv:2501.01184},
year={2025}
}