Dynamic Facial Expression Recognition (DFER) is facing supervised dillema. On the one hand, current efforts in DFER focus on developing various deep supervised models, but only achieving incremental progress which is mainly attributed to the longstanding lack of large-scale high-quality datasets. On the other hand, due to the ambiguity and subjectivity in facial expression perception, acquiring large-scale high-quality DFER samples is pretty time-consuming and labor-intensive. Considering that there are massive unlabeled facial videos on the Internet, this work aims to explore a new way (i.e., self-supervised learning) which can fully exploit large-scale unlabeled data to largely advance the development of DFER.
Inspired by recent success of VideoMAE, MAE-DFER makes an early attempt to devise a novel masked autoencoder based self-supervised framework for DFER. It improves VideoMAE by developing an efficient LGI-Former as the encoder and introducing joint masked appearance and motion modeling. With these two core designs, MAE-DFER largely reduces the computational cost (about 38% FLOPs) during fine-tuning while having comparable or even better performance.
The architecture of LGI-Former.
Extensive experiments on six DFER datasets show that our MAE-DFER consistently outperforms the previous best supervised methods by significant margins (+5∼8% UAR on three in-the-wild datasets and +7∼12% WAR on three lab-controlled datasets), which demonstrates that it can learn powerful dynamic facial representations for DFER via large-scale self-supervised pre-training. We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks (e.g., dynamic micro-expression recognition and facial action unit detection).
-
UAR (Unweighted Accuracy Rate)
$\text{UAR} = \frac{1}{N} \sum_{i=1}^{N} \text{Accuracy}_i$ -
WAR (Weighted Accuracy Rate)
$\text{WAR} = \sum_{i=1}^{N} \left( \frac{n_i}{N} \times \text{Accuracy}_i \right)$ -
$n_i$ is the number of samples in class i -
$N$ is the total number of samples -
$Accuracy_i$ is the accuracy for class i
-
WAR is more common-used
The environment is tested with both python 3.8 and python 3.10.
conda create -n <your_env_name> python=3.10
pip install -r requirement.txt
- Clone the VideoMamba repo:
git clone https://github.com/OpenGVLab/VideoMamba.git
- Install its dependencies:
cd VideoMamba pip install -e causal-conv1d pip install -e mamba
Please follow the files (e.g., dfew.py) in preprocess for data preparation.
Specifically, you need to enerate annotations for dataloader ("<path_to_video> <video_class>" in annotations).
The annotation usually includes train.csv
, val.csv
and test.csv
. The format of *.csv
file is like:
dataset_root/video_1 label_1
dataset_root/video_2 label_2
dataset_root/video_3 label_3
...
dataset_root/video_N label_N
An example of train.csv of DFEW fold1 (fd1) is shown as follows:
/mnt/data1/brain/AC/Dataset/DFEW/Clip/jpg_256/02522 5
/mnt/data1/brain/AC/Dataset/DFEW/Clip/jpg_256/02536 5
/mnt/data1/brain/AC/Dataset/DFEW/Clip/jpg_256/02578 6
Note that, label
for the pre-training dataset (i.e., VoxCeleb2) is dummy label, you can simply use 0
(see voxceleb2.py).
-
VoxCeleb2
python run_pretraining_with_yacs.py \ --config configs/voxceleb2_pretrain.yaml \ --output_dir output/voxceleb2_pretrain/
You can download our pre-trained model on VoxCeleb2 from here and put it into this folder.
Put the pre-train model at
saved/model/pretraining/voxceleb2/videomae_pretrain_xxx
for fine-tuning
-
DFEW
python run_finetuning_with_yacs.py \ --config configs/dfew_finetune.yaml \ --output_dir output/dfew_finetune/ \
-
FERV39k
Dataset not available yet.
-
MAFW
python run_finetuning_with_yacs.py \ --config configs/mafw_finetune.yaml \ --output_dir output/mafw_finetune/
Not available yet.
- Download the Gaze 360 dataset from Gaze 360 to the current folder.
- Run the preprocess/data_prepocessing_gaze360.py script to normalize the dataset and labels.
- Run the preprocess/preprocess_gaze360.py to align the dataset labels, which will be generated to
saved/data/gaze360/
.
For a emotion dataset, we need
Transfer the csvs from videos representation to frames representation. i.e,
Original csv is like, an example is dfew_224
dataset_root/video_1 label_1
dataset_root/video_2 label_2
...
After step one, it should be
dataset_root/video_1/00001 label_1
dataset_root/video_1/00002 label_1
...
dataset_root/video_2/00001 label_2
...
Split the whole csv into several csvs by their video, you can check this script
After step two, it should be like gaze360T
- test
----- test_00000_0.csv
-------- dataset_root/video_1/00001 label_1
-------- dataset_root/video_1/00002 label_1
-------- ...
----- test_00000_1.csv
-------- dataset_root/video_1/00101 label_1
-------- dataset_root/video_1/00102 label_1
-------- ...
...
- train
...
Run relbl_with_gaze.py to relabel the emotion datasets with gaze information, you should check the paths in the file, and then get a csv like:
dataset_root/video_1 label_1 pitch yaw
dataset_root/video_2 label_2 pitch yaw
...
Repeat step two to split this csv into several ones by video. A final example is dfew_combine.
python run_finetuning_with_yacs.py --config configs/dfew_combine.yaml --output_dir output/dfew_combine/
You should note these configs:
data:
num_classes_cls: 7 # Number of classes for classification
num_dim_reg: 2 # Number of dimensions for regression
training:
combine_loss_alpha : 0.0 # Weight for classification loss in combined loss
python run_finetuning_with_yacs.py \
--config configs/gaze360_finetune.yaml \
--output_dir output/gaze360_finetune/