This paper proposes AERSUT (Automatic Emotion Recognition System Using TinyML), a system that detects and analyzes emotions from speech signals using TinyML. The system classifies emotions into eight categories: surprise, neutral, disgust, fear, sad, calm, happy, and anger. The implementation focuses on running efficiently on resource-constrained devices while maintaining high accuracy.
- Multi-emotion Classification: Detects 8 distinct emotional states from speech
- Advanced Feature Extraction: Utilizes MFCCs, Mel-spectrograms, Zero Crossing Rate, and RMS Energy
- Data Augmentation: Implements noise addition, time stretching, and signal shifting for robust training
- TinyML Integration: Optimized for deployment on resource-constrained devices
- Dual-Model Approach: Implements both CNN and CNN-LSTM architectures for performance comparison
- High Accuracy: Achieves up to 72% test accuracy on the combined dataset
The system uses a combination of two benchmark datasets, totaling approximately 20,000 audio samples:
- Size: 1,440 speech samples
- Actors: 24 professional actors (12 male, 12 female)
- Emotions: 8 emotional states (neutral, calm, happy, sad, angry, fear, disgust, surprise)
- Format: 16-bit, 48kHz .wav files
- Size: 7,442 audio clips
- Actors: 91 actors (48 male, 43 female)
- Diversity: Ages 20-74, various races and ethnicities
- Emotions: 6 emotional states (angry, disgust, fear, happy, neutral, sad)
To enhance model robustness, the following augmentation techniques were applied:
- Noise Injection: Adding Gaussian noise to audio signals
- Time Stretching: ±20% variation in speech rate
- Pitch Shifting: Modulating pitch by ±3 semitones
- Time Shifting: Randomly shifting audio in time
- Visual representation of speech's temporal and spectral changes
- Captures rich emotional features not present in text/transcripts
- Provides 2D feature maps for CNN processing
- 13 coefficients extracted per frame
- Pre-emphasis to enhance high frequencies
- Mel-scale transformation for human-like frequency perception
- DCT for decorrelation of filter bank energies
- Zero Crossing Rate: Measures signal noisiness and periodicity
- Root Mean Squared Energy: Represents signal power over time
- Input: MFCC features (13 coefficients × 130 frames)
- Convolutional Blocks:
- Conv1D (256 filters, kernel=5) → BatchNorm → ReLU → MaxPool
- Conv1D (128 filters, kernel=3) → BatchNorm → ReLU → MaxPool
- Conv1D (64 filters, kernel=3) → BatchNorm → ReLU → MaxPool
- Classification Head:
- Flatten → Dense(1024, ReLU) → Dropout(0.5)
- Dense(8, softmax)
- Feature Extraction: CNN layers similar to above
- Temporal Modeling: LSTM layers to capture sequential dependencies
- Training: 120 epochs with learning rate 0.00001
- Performance: 96% training accuracy, 72% test accuracy
- Python 3.8+
- TensorFlow 2.x
- Librosa
- NumPy
- Pandas
- Matplotlib
- Scikit-learn
- Seaborn
- IPython
- SoundFile
-
Clone the repository:
git clone https://github.com/yourusername/Automatic-Emotion-Recognition-using-TinyML.git cd Automatic-Emotion-Recognition-using-TinyML
-
Install the required packages:
pip install -r requirements.txt
-
Data Preparation
- Download the RAVDESS and CREMA-D datasets
- Place them in the
input/
directory with the following structure:input/ ├── ravdess-emotional-speech-audio/ │ └── audio_speech_actors_01-24/ └── cremad/ └── AudioWAV/
-
Run the Emotion Recognition Pipeline
python emotion_recognition.py
The system includes several data augmentation techniques to improve model generalization:
- Noise Addition: Adds random Gaussian noise to audio signals
- Time Stretching: Slows down or speeds up the audio
- Pitch Shifting: Modifies the pitch of the audio
- Time Shifting: Shifts the audio in time
- MFCCs (Mel-frequency cepstral coefficients): 13 coefficients
- Delta and Delta-Delta: First and second derivatives of MFCCs
- Feature Normalization: Standard scaling of features
- Framework: TensorFlow/Keras
- Environment: Google Colab with GPU acceleration
- Epochs: 120
- Batch Size: 32
- Optimizer: Adam (learning rate=0.00001)
- Loss Function: Categorical Cross-Entropy
- Callbacks:
- Learning Rate Reduction on Plateau
- Model Checkpointing
- Early Stopping
Model | Training Accuracy | Test Accuracy |
---|---|---|
CNN | 99% | 67% |
CNN-LSTM | 96% | 72% |
- CNN-LSTM showed better generalization (higher test accuracy)
- CNN showed signs of overfitting (99% training vs 67% test)
- The combined use of MFCCs and mel-spectrograms provided robust features
- Data augmentation significantly improved model robustness
Model performance is evaluated using:
- Accuracy
- Confusion Matrix
- Classification Report (Precision, Recall, F1-Score)
The trained models were optimized for edge deployment using TensorFlow Lite:
import tensorflow as tf
# Convert the model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
# Save the model
with open('emotion_recognition.tflite', 'wb') as f:
f.write(tflite_model)
- Hardware: Compatible with various TinyML boards
- Inference: Real-time emotion classification
- Input: Audio from onboard microphone
- Output: Predicted emotion class with confidence score
- Waveform Visualization: Time-domain representation of audio signals
- Spectrograms: Frequency content over time for different emotions
- MFCC Plots: Visualizing cepstral coefficients
- Learning Curves: Training/validation accuracy and loss over epochs
- Confusion Matrices: Per-class performance analysis
- Feature Importance: Analysis of which features contribute most to classification
- Live visualization of input audio features
- Real-time emotion classification feedback
This project is licensed under the MIT License - see the LICENSE file for details.
- RAVDESS Dataset: https://zenodo.org/record/1188976
- CREMA-D Dataset: https://github.com/CheyneyComputerScience/CREMA-D