This repository presents different approaches to the Music Emotion Recognition task from a regression perspective, using Convolutional Neural Networks trained on MFCC audio features.
Considering the 2D representation of emotions defined by valence and arousal dimensions [Circumplex Model of Affect], two types of models are created:
- 2D-output model : predicts values for both valence and arousal
- 1D-output models : separately predict values for valence and arousal, respectively
For the complete description of the methods and experiments, please refer to unimodal MER.
The dataset used is The MediaEval Database for Emotional Analysis of Music , consisting in 1,744 song excerpts of ~45sec duration, with two types of annotations for valence and arousal available: dynamic —measured per second— and static —measured per 45sec. In this project, the static annotations are used.
The data directory should have the following structure:
.
├── Data
├── DEAM_dataset
│ ├── Audio
│ │ ├── *.mp3
│ ├── static_annotations.csv
Extract annotations and audio names from
static_annotations.csv
Augment dataset Extract MFCC features from waveforms Make train and test sets
There are three options for training:
- 2D-output model:
--dimension=both
will create a model to predict both valence and arousal, with filter size defined in--params_dict
- valence model:
--dimension=valence
will create a model to predict valence with filter size defined in--valence_params_dict
- arousal model:
--dimension=arousal
will create a model to predict valence with filter size defined in--arousal_params_dict
Control the training by modifying the default values for the following parameters:
--device = cuda (train on cuda)
--log_interval = 1 (print train & validation loss each epoch)
--num_epochs = 2000
The model saved as
Models/model_<dimension>.pt
will be loaded.
- for the 2D-output model:
--dimension=both
- for the valence model:
--dimension=valence
- for the arousal model:
--dimension=arousal
PyTorch
, librosa