Aref Farhadipour1,2, Hossein Ranjbar1, Masoumeh Chapariniya1,2, Teodora Vukovic1,2, Sarah Ebling1, Volker Dellwo1
1 Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
2 Digital Society Initiative, University of Zurich, Zurich, Switzerland
This paper presents a multimodal approach to tackle emotion recognition and sentiment analysis, on the MELD dataset. We propose a system that integrates four key modalities using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, InceptionResNet for facial expressions, and a MobileNet-V2 + Local Transformer architecture trained from scratch for video analysis. The architecture of the proposed system is depicted in the following graph.
This repository provides a PyTorch-based implementation of Video-based Emotion Recognition( MobileNet-V2 + Local Transformer).
- Download and extract MELD dataset
mkdir dataset
wget https://web.eecs.umich.edu/~mihalcea/downloads/MELD.Raw.tar.gz
tar -xvzf MELD.Raw.tar.gz
cd ..
- Clone this repository
git clone https://github.com/HoseinRanjbar/Emotion_Recognition.git
cd Emotion_Recognition
- Install dependent packages
pip install -r requirements.txt
- Download weights
mkdir pretrained_model
sh scripts/download_weights.sh
- Test
Use the following command to test on the MELD dataset.
python3 -u test.py --data {PATH-TO-DATASET} --model_path ./pretrained_model/BEST.pt --num_classes 7 --recognition 'emotion' --save_dir ./output --lookup_table ./tools/data/emotion_lookup_table.json --classifier_hidden_size 512
- Training
We also provide the training code as follows:
train.py --data {PATH-TO-DATASET} --batch_size 2 --num_classes 7 --recognition 'emtion' --emb_network 'mb2' --hidden_size 1280 --save_dir ./pretrained_model --lookup_table ./tools/data/sentiment_lookup_table.json --initial_lr 0.001 --num_layers 2 --dp_keep_prob 0.8 --classifier_hidden_size 512