Skip to content

Exam project of Computer Vision course (Master degree). Goal is emotion recognition by audio and videos using Deep Learning models.

Notifications You must be signed in to change notification settings

fraadap/emotion-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🎭 Emotion Recognition through Audio & Video Analysis

📄 Abstract

This project aims to predict the interlocutor's emotions through the combined analysis of audio and video signals. Using the RAVDESS dataset, our system recognizes 8 emotions:

  • Happiness
  • Surprise
  • Fear
  • Anger
  • Sadness
  • Neutrality
  • Calmness
  • Disgust

This technology offers potential applications in fields such as psychological therapy (continuous emotion monitoring), support for individuals with autism in recognizing others' emotions, and enhancing user experience in emotionally aware chatbots.


📚 Introduction

Detecting emotions from audio and video signals opens new avenues for interaction and monitoring in sensitive contexts. Our goal is to develop an ensemble system that integrates the following models:

  • Audio Model (CNN):
    Extracts and analyzes audio features using MFCCs to recognize emotions based on voice characteristics.

  • Video Models (LSTM and GRU):
    Analyzes action units extracted from facial landmarks to capture the dynamics of facial expressions. The LSTM and GRU models complement each other by capturing long-term and short-term expression changes, respectively.


🎥 Dataset and Preprocessing

RAVDESS Dataset

  • Contents:
    • Videos, separate audio tracks, and combined audio-video files.
    • 1440 audio-video files representing emotions expressed strongly or moderately.
  • Facial Landmark Tracking:
    • CSV files contain coordinates of action units for each video, which are used to analyze facial expressions.

Preprocessing

  • Audio:
    • Extraction of MFCCs with normalization by subtracting the peak frequency.
  • Video:
    • Extraction and normalization of action units.
  • Data Augmentation:
    • Video: Flipping the y-axis (modifying x-coordinates) to double the sample count.
    • Audio: Techniques such as semitone shifting, adding noise, and volume variation were applied, initially increasing CNN accuracy by 14%; however, this approach was discarded in the final ensemble.

🧠 Models and Architectures

Audio Model – CNN

  • Architecture:
    • 4 2D convolutional layers (e.g., 32, 64, 128, 256 filters with 3×3 kernels, ReLU activation)
    • 4 fully connected layers
    • Dropout and batch normalization applied to combat overfitting
  • Feature: MFCC

Video Models – LSTM & GRU

  • LSTM:
    • 2 LSTM layers with 128 units each
    • 2 fully connected layers
    • Capable of learning long-term dynamics of facial expressions
  • GRU:
    • 1D convolutional layer for initial feature extraction
    • 1 GRU layer with 64 units
    • 2 fully connected layers
    • Effective in avoiding overfitting and capturing short-term expression variations

⚙️ Ensemble Method

  • Stacking:

    • The predictions from each model are obtained prior to applying softmax, producing 8 probability values per model (a total of 24 values).
    • The stacking model comprises 2 fully connected layers with ReLU activation that ultimately output the final prediction via softmax.
  • Experiments:

    • A double ensemble approach (first combining LSTM and GRU, then adding the CNN) was tested but led to a decrease in performance due to reduced sample sizes for the third model. The final configuration integrates all 3 models into a single ensemble, achieving 92.2% accuracy on the test set.

📊 Results and Metrics

  • Ensemble Accuracy: 92.2%
  • Additional metrics such as precision, recall, and F1-score were also evaluated to assess the contribution of each model, highlighting a beneficial synergy in the ensemble phase.

🛠️ Environment and Dependencies


👥 Team

Developers:

Francesco D'Aprile, Sara Lazzaroni

📝 Conclusions

The proposed system demonstrates how a well-designed ensemble can improve performance over individual models, achieving an accuracy of 92.2% in emotion prediction. The careful preprocessing, feature selection, and innovative stacking approach form the backbone of this system, paving the way for applications in therapy, cognitive support, and human-machine interaction.

For further details please refer to the notebook project.

About

Exam project of Computer Vision course (Master degree). Goal is emotion recognition by audio and videos using Deep Learning models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •