This repository contains the implementation of Social-MAE proposed in our FG 2024 paper.
Social-MAE is a pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation.
Please cite the following paper.
@inproceedings{bohy2024social,
title={Social-mae: A transformer-based multimodal autoencoder for face and voice},
author={Bohy, Hugo and Tran, Minh and El Haddad, Kevin and Dutoit, Thierry and Soleymani, Mohammad},
booktitle={2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)},
pages={1--5},
year={2024},
organization={IEEE}
}
The social-mae models are defined in src/models/social_mae.py
and include two variants:
SOCIALMAE
: This model is designed for pretraining purposes and includes a decoder component. It is intended to learn robust representations from paired audio and image data, which can be used to initialize models for downstream tasks.SOCIALMAEFT
: This model is optimized for finetuning on specific tasks and does not include a decoder. It builds on the pretrained representations to perform task-specific adjustments.
Both models, SOCIALMAE
and SOCIALMAEFT
, require inputs comprising pairs of audio and image data:
-
Audio Input: The audio input should have the shape [batch size, length_in_frame, mel_bin_num], where: batch size is the number of audio samples in a batch. length_in_frame specifies the duration of the audio in terms of the number of frames. mel_bin_num is the number of Mel frequency bins.
Example shape: [batch size, 1024, 128].
-
Image Input: The image input should be in the format [batch size, channel_num, height, width, duration], where: batch size is the number of images in a batch. channel_num refers to the number of color channels. height and width denote the dimensions of each image, commonly 224 x 224. duration indicates the number of frames in the video or sequence context, in this case 8 for temporal data.
Example shape: [batch size, 3, 224, 224, 8].
Before running any experiments, ensure that you have installed all the required packages for this project (eventually in a separate virtual environment):
pip install -r requirements.txt
To begin pretraining the model using your audio and video data, execute the following command:
python src/main_pretrain.py --data-video path/to/video/data --data-audio path/to/audio/data --dataset MyDataset
To finetune the model using your audio and video data, you will need a custom dataloader and a pretrained checkpoint. Run:
python src/main_ft.py ----data-video path/to/video/data --data-audio path/to/audio/data --dataset MyDataset --pretrained-checkpoint path/to/checkpoint.pth
To run inference:
python src/main_ft.py ----data-video path/to/video/data --data-audio path/to/audio/data --dataset MyDataset --pretrained-checkpoint path/to/checkpoint.pth --no-training
Pretained checkpoints are available at:
Model Name | Pretrain Data | Finetuned Data | Batch Size | Lambda_c | Masking Ratio | FPS | Duration (s) |
---|---|---|---|---|---|---|---|
Social-MAE |
VoxCeleb2 | n.a. | 36 | 0.01 | 75% Unstructured | 2 | 4 |
Social-MAE-FT |
VoxCeleb2 | CREMA-D | 12 | 0.01 | 75% Unstructured | 2 | 4 |
Social-MAE-FT |
VoxCeleb2 | ChaLearn FI | 8 | 0.01 | 75% Unstructured | 2 | 4 |
Social-MAE-FT |
VoxCeleb2 | NDC-ME | 8 | 0.01 | 75% Unstructured | 2 | 4 |
If you have a question, please bring up an issue (preferred) or send me an email hugo.bohy@umons.ac.be.