🎭 Emotion Recognition through Audio & Video Analysis

📄 Abstract

This project aims to predict the interlocutor's emotions through the combined analysis of audio and video signals. Using the RAVDESS dataset, our system recognizes 8 emotions:

Happiness
Surprise
Fear
Anger
Sadness
Neutrality
Calmness
Disgust

This technology offers potential applications in fields such as psychological therapy (continuous emotion monitoring), support for individuals with autism in recognizing others' emotions, and enhancing user experience in emotionally aware chatbots.

📚 Introduction

Detecting emotions from audio and video signals opens new avenues for interaction and monitoring in sensitive contexts. Our goal is to develop an ensemble system that integrates the following models:

Audio Model (CNN):
Extracts and analyzes audio features using MFCCs to recognize emotions based on voice characteristics.
Video Models (LSTM and GRU):
Analyzes action units extracted from facial landmarks to capture the dynamics of facial expressions. The LSTM and GRU models complement each other by capturing long-term and short-term expression changes, respectively.

🎥 Dataset and Preprocessing

RAVDESS Dataset

Contents:
- Videos, separate audio tracks, and combined audio-video files.
- 1440 audio-video files representing emotions expressed strongly or moderately.
Facial Landmark Tracking:
- CSV files contain coordinates of action units for each video, which are used to analyze facial expressions.

Preprocessing

Audio:
- Extraction of MFCCs with normalization by subtracting the peak frequency.
Video:
- Extraction and normalization of action units.
Data Augmentation:
- Video: Flipping the y-axis (modifying x-coordinates) to double the sample count.
- Audio: Techniques such as semitone shifting, adding noise, and volume variation were applied, initially increasing CNN accuracy by 14%; however, this approach was discarded in the final ensemble.

🧠 Models and Architectures

Audio Model – CNN

Architecture:
- 4 2D convolutional layers (e.g., 32, 64, 128, 256 filters with 3×3 kernels, ReLU activation)
- 4 fully connected layers
- Dropout and batch normalization applied to combat overfitting
Feature: MFCC

Video Models – LSTM & GRU

LSTM:
- 2 LSTM layers with 128 units each
- 2 fully connected layers
- Capable of learning long-term dynamics of facial expressions
GRU:
- 1D convolutional layer for initial feature extraction
- 1 GRU layer with 64 units
- 2 fully connected layers
- Effective in avoiding overfitting and capturing short-term expression variations

⚙️ Ensemble Method

Stacking:
- The predictions from each model are obtained prior to applying softmax, producing 8 probability values per model (a total of 24 values).
- The stacking model comprises 2 fully connected layers with ReLU activation that ultimately output the final prediction via softmax.
Experiments:
- A double ensemble approach (first combining LSTM and GRU, then adding the CNN) was tested but led to a decrease in performance due to reduced sample sizes for the third model. The final configuration integrates all 3 models into a single ensemble, achieving 92.2% accuracy on the test set.

📊 Results and Metrics

Ensemble Accuracy: 92.2%
Additional metrics such as precision, recall, and F1-score were also evaluated to assess the contribution of each model, highlighting a beneficial synergy in the ensemble phase.

🛠️ Environment and Dependencies

Programming Language: Python
Main Libraries:
- TensorFlow/Keras (or PyTorch, depending on the implementation)
- scikit-learn for Random Forest and feature engineering
- Librosa for MFCC extraction
- NumPy

👥 Team

Developers:

Francesco D'Aprile, Sara Lazzaroni

📝 Conclusions

The proposed system demonstrates how a well-designed ensemble can improve performance over individual models, achieving an accuracy of 92.2% in emotion prediction. The careful preprocessing, feature selection, and innovative stacking approach form the backbone of this system, paving the way for applications in therapy, cognitive support, and human-machine interaction.

For further details please refer to the notebook project.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
models		models
version_old		version_old
README.md		README.md
ravdess-emotion-recognition.ipynb		ravdess-emotion-recognition.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎭 Emotion Recognition through Audio & Video Analysis

📄 Abstract

📚 Introduction

🎥 Dataset and Preprocessing

RAVDESS Dataset

Preprocessing

🧠 Models and Architectures

Audio Model – CNN

Video Models – LSTM & GRU

⚙️ Ensemble Method

📊 Results and Metrics

🛠️ Environment and Dependencies

👥 Team

📝 Conclusions

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

fraadap/emotion-recognition

Folders and files

Latest commit

History

Repository files navigation

🎭 Emotion Recognition through Audio & Video Analysis

📄 Abstract

📚 Introduction

🎥 Dataset and Preprocessing

RAVDESS Dataset

Preprocessing

🧠 Models and Architectures

Audio Model – CNN

Video Models – LSTM & GRU

⚙️ Ensemble Method

📊 Results and Metrics

🛠️ Environment and Dependencies

👥 Team

📝 Conclusions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages