Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environments - Submitted to EUSIPCO 2025
🔥 Official repository PyTorch-Lightning code for the paper:
"Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environments"
This work is an extension of our previous paper "Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer
Architecture", incorporating multi-modal learning to further improve robustness.
📄 Read the Paper
Human emotions are conveyed through speech and facial expressions, making multi-modal emotion recognition (MER) crucial for robust emotion classification. This work introduces a multi-microphone and multi-modal system combining:
- HTS-AT Transformer for multi-channel audio processing
- R(2+1)D ResNet CNN for video-based emotion recognition
- Late fusion concatenation to combine audio & video features
✔️ Multi-Microphone Audio Processing: Robust against reverberation
✔️ Multi-Modal Learning: Combines speech and facial cues
✔️ Tested on RAVDESS convolved with Real-World RIRs (ACE Database)
✔️ Pretrained Models Available for Fine-Tuning and testing
Our approach consists of two main components:
- HTS-AT Transformer (Audio Modality):
- Processes multi-channel mel-spectrograms
- Uses Patch-Embed Summation & Averaging strategies
- Extracts deep features for robust emotion classification
- R(2+1)D CNN (Video Modality):
- Extracts spatiotemporal features from facial expressions
- Pretrained on Kinetics dataset, fine-tuned for MER
- Feature Fusion & Classification:
- Late fusion via concatenation of extracted embeddings
- Fully connected layers for final emotion classification
Multi-channel Multi-modal Architecture | Multi-Microphone Audio | Video-Based Recognition |
---|---|---|
![]() |
![]() |
![]() |
The base way to run the code is with Docker Container.
docker pull ohadico97/mer:v1
After having the image, run:
docker run -it --gpus all --shm-size 20G -v $HOME:$HOME --name <container name> ohadico97/mer:v1
git clone https://github.com/OhadCohen97/Multi-Microphone-Multi-Modal-Emotion-Recognition-in-Reverberant-Environments.git
cd Multi-Microphone-Multi-Modal-Emotion-Recognition-in-Reverberant-Environments
Python 3.8.13 ('base')
The previous paper evaluated the models on three datasets: RAVDESS, IEMOCAP, and CREMA-D. In this work, we focus exclusively on the RAVDESS dataset. The training and validation splits are reverberated synthetically using the 'gpuRIR' Python library, while the test sets are reverberated with real-world ACE RIRs recorded in various acoustic environments. You can choose which modality to fine-tune using the multimodal
flag in config.py
.
Note that the first work fine-tuned on different amounts of samples and splits.
Place datasets inside data/
folder:
MER/
│── data/
│── <dataset name & type>/
│── train.npy
│── val.npy
│── test.npy
│── ACE/
│── <dataset name>
│── MP1/
│── Lecture_room_1_508
│── train.npy
│── val.npy
│── test.npy
│── Lecture_room_2_403a
│── train.npy
│── val.npy
│── test.npy
│── lobby
...
│── Meeting room_2 _611
│── Office_1_502
│── Office_2_803
For preprocess the data see the notebooks.
First, in config.py
, make sure you have the HTS-AT AudioSet pre-trained model and the path to it. Set the root path to the dataset you wish to use.
To train:
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3 main.py train
To evaluate the models, set the paths for the npy files. See in main.py test_ace (lines 219,236,253)
.
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3 main.py test_ace
For a single test run on one npy file:
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3 main.py test
If you use this work, please cite:
@article{cohen2024multi,
title={Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment},
author={Cohen, Ohad and Hazan, Gershon and Gannot, Sharon},
journal={arXiv preprint arXiv:2409.09545},
year={2024}
}
This research was supported by the European Union’s Horizon 2020 Program and the Audition Project, Data Science Program, Israel.
For questions or collaborations, feel free to reach out:
📧 ohad.cohen@biu.ac.il