This project aims to predict the interlocutor's emotions through the combined analysis of audio and video signals. Using the RAVDESS dataset, our system recognizes 8 emotions:
- Happiness
- Surprise
- Fear
- Anger
- Sadness
- Neutrality
- Calmness
- Disgust
This technology offers potential applications in fields such as psychological therapy (continuous emotion monitoring), support for individuals with autism in recognizing others' emotions, and enhancing user experience in emotionally aware chatbots.
Detecting emotions from audio and video signals opens new avenues for interaction and monitoring in sensitive contexts. Our goal is to develop an ensemble system that integrates the following models:
-
Audio Model (CNN):
Extracts and analyzes audio features using MFCCs to recognize emotions based on voice characteristics. -
Video Models (LSTM and GRU):
Analyzes action units extracted from facial landmarks to capture the dynamics of facial expressions. The LSTM and GRU models complement each other by capturing long-term and short-term expression changes, respectively.
- Contents:
- Videos, separate audio tracks, and combined audio-video files.
- 1440 audio-video files representing emotions expressed strongly or moderately.
- Facial Landmark Tracking:
- CSV files contain coordinates of action units for each video, which are used to analyze facial expressions.
- Audio:
- Extraction of MFCCs with normalization by subtracting the peak frequency.
- Video:
- Extraction and normalization of action units.
- Data Augmentation:
- Video: Flipping the y-axis (modifying x-coordinates) to double the sample count.
- Audio: Techniques such as semitone shifting, adding noise, and volume variation were applied, initially increasing CNN accuracy by 14%; however, this approach was discarded in the final ensemble.
- Architecture:
- 4 2D convolutional layers (e.g., 32, 64, 128, 256 filters with 3×3 kernels, ReLU activation)
- 4 fully connected layers
- Dropout and batch normalization applied to combat overfitting
- Feature: MFCC
- LSTM:
- 2 LSTM layers with 128 units each
- 2 fully connected layers
- Capable of learning long-term dynamics of facial expressions
- GRU:
- 1D convolutional layer for initial feature extraction
- 1 GRU layer with 64 units
- 2 fully connected layers
- Effective in avoiding overfitting and capturing short-term expression variations
-
Stacking:
- The predictions from each model are obtained prior to applying softmax, producing 8 probability values per model (a total of 24 values).
- The stacking model comprises 2 fully connected layers with ReLU activation that ultimately output the final prediction via softmax.
-
Experiments:
- A double ensemble approach (first combining LSTM and GRU, then adding the CNN) was tested but led to a decrease in performance due to reduced sample sizes for the third model. The final configuration integrates all 3 models into a single ensemble, achieving 92.2% accuracy on the test set.
- Ensemble Accuracy: 92.2%
- Additional metrics such as precision, recall, and F1-score were also evaluated to assess the contribution of each model, highlighting a beneficial synergy in the ensemble phase.
- Programming Language: Python
- Main Libraries:
- TensorFlow/Keras (or PyTorch, depending on the implementation)
- scikit-learn for Random Forest and feature engineering
- Librosa for MFCC extraction
- NumPy
Developers:
Francesco D'Aprile, Sara Lazzaroni
The proposed system demonstrates how a well-designed ensemble can improve performance over individual models, achieving an accuracy of 92.2% in emotion prediction. The careful preprocessing, feature selection, and innovative stacking approach form the backbone of this system, paving the way for applications in therapy, cognitive support, and human-machine interaction.
For further details please refer to the notebook project.