This repository contains all the code and usage instructions related to models proposed and implemented by team JMI for the Multimodal Emotion Cause Pair Extraction task at SemEval 2024.
We have experimented with two LLM-based approaches for the subtask-2 i.e. the multimodal task that both gave competitive results.
- GPT-4 Vision for Video Captioning and Few-shot prompting GPT-3.5 using RAG for Emotion and Cause Prediction (Best Results)
- Fine-tuning LLaMA2 for Emotion and Cause Prediction.
├── data
│ ├── submission
│ │ └── Few-Shot GPT using RAG
│ │ ├── gpt-output
│ │ └── gpt-out-self-added
│ └── text
├── GPT-RAG
│ ├── video_captioning.ipynb
│ ├── index-creation.ipynb
│ ├── emotion_prediction.ipynb
│ ├── cause_prediction.ipynb
│ ├── cause_analysis.ipynb
│ ├── requirements.txt
│ └── eval_videos
├── Llama2
│ ├── cause_inference.ipynb
│ ├── cause_prediction_training.ipynb
│ ├── emotion_inference.ipynb
│ ├── emotion_recognition_training.ipynb
│ ├── generate_input.py
│ └── requirements.txt
├── old_experiments
├── README.md
data/submission
folder contains the submitted json file for subtask-2. gpt-out-self-added
is the best submission.
Move into the GPT-RAG folder and install all the dependencies for the project:
cd GPT-RAG
pip install -r requirements.txt
Download the videos of eval set from Google Drive Link and extract it into GPT-RAG/eval_videos/
folder.
Download the following zip file: Google Drive Link
Extract the contents of zip file into the GPT-RAG
folder.
It contains all the intermediate processed outputs such as:
-
frames
: Folder containing images for each utterance videos from eval set. For an utterance video, 9 equidistant frames were sampled from the video and were placed in row-major order in a 3x3 grid to make up a single image.video_captioning.ipynb
contains the code to generate the images. -
all_emotion_index
: FAISS index containing OpenAIText-embedding-ada-002-v2
embeddings for conversations from training set which contained emotional utterances for all 6 emotions.index-creation.ipynb
contains the code to generate the index. -
cause_windows
: FAISS index containing OpenAIText-embedding-ada-002-v2
embeddings for conversational windows from training set for each 6 different emotional utterances at 3 different positions: beginning, middle and end.- beginning: For an emotional utterance at beginning of conversation, a window of 3 utterances is created containing the current utterance and the next 2 utterances.
- middle: For an emotional utterance in middle of conversation, a window of 8 utterances is created containing the current utterance, previous 5 utterances and the next 2 utterances.
- end: For an emotional utterance at end of conversation, a window of 6 utterances is created containing the current utterance and the previous 5 utterances.
In total, 18 indices are created.
index-creation.ipynb
contains the code to generate all the windows and index.
-
eval_raw_out.json
: JSON file containing batched captions for each video conversation of eval set. Due to Rate limits of OpenAIgpt-4-vision-preview
API, 10 images can be sent per requests. These 10 images were the generated images for each utterance videos fromframes
folder for a conversation. Thus a conversation with 14 utterances would be processed as 2 requests with 10 utterance sequence being captioned and then the next 4.video_captioning.ipynb
contains the code to generate the captions. -
eval_proc_out.json
: JSON file containing final captions for each video conversation of eval set. The batched outputs are combined by prompting OpenAIgpt-3.5-turbo-1106
API to generate a coherent caption for whole sequence.video_captioning.ipynb
contains the code to postprocess the generated captions. -
emotion_explainations.json
: JSON file containing explanations generated by prompting OpenAIgpt-3.5-turbo-1106
API for annotation of each emotion in conversations from training set containing all emotions.index-creation.ipynb
contains the code to generate the explanations. -
cause_windows.json
: JSON file containing all the different windows created for each emotional utterance along with a prompt asking for explanations for each cause annotation.index-creation.ipynb
contains the code to generate the file. -
emotion_eval_labelled.json
: JSON file containing annotated emotions for each conversation in eval set.emotion_prediction.ipynb
contains the code to annotate emotions. -
cur_anno.json
: JSON file containing annotated emotion-cause pairs for eval set.cause_prediction.json
contains the code to annotate causes for each emotional utterance. -
cur_anno_same_added.json
: JSON file which is a postprocessedcur_anno.json
where for each emotional utterance in all conversations of eval set, if self-cause is not present in the emotion-cause pairs, then self-cause is added.cause_prediction.json
contains the code to perform this postprocessing step. This is the best prediction file
-
video_captioning.ipynb
: generates images for each video utterances. Prompts GPT-4 Vision to generate captions for conversations and store them in a JSON file. -
index-creation.ipynb
: generates FAISS index for conversation containing all emotions and also for windows of each emotional utterance in conversations from training set. -
emotion_prediction.ipynb
: generates prediction for emotions by prompting OpenAIgpt-3.5-turbo-1106
API to annotate emotions in a conversation. For a conversation from eval set, the closest conversation fromall_emotion_index
is retrieved along with its explanation fromemotion_explainations.json
which are provided as a few-shot example in the prompt. Also video caption for whole conversation is provided as part of context in prompt. -
cause_prediction.ipynb
: generates prediction of emotion-cause pairs for a conversation which has emotions annotated by prompting OpenAIgpt-3.5-turbo-1106
API. For an annotated emotional utterance with emotion E, a window is created around the utterance based on its position P (beg/mid/end described above), and then three closest windows from training set are retrieved from FAISS index for emotion E and position P. GPT-3.5 is prompted to give explanations for each 3 closest window for their causes. These 3 windows are provided as a few-shot example for cause prediction prompt. Video caption for whole conversation is also provided as part of context in prompt. -
cause_analysis.ipynb
: contains exploratory data analysis for distribution of causes and their positions with respect to emotional utterances.
Move into the Llama2 folder and install all the dependencies for the project:
cd Llama2
pip install -r requirements.txt
-
emotion_recognition_training.ipynb
: fine-tunes ameta-llama/Llama-2-13b-chat-hf
LLM to predict the emotion label of a particular utterance in a given conversation. As context, for predicting the emotion label of each utterance, we provide the entire conversation along with the speaker information to guide the prediction. -
emotion_inference.ipynb
: for performing inference on test data using the fine-tuned Llama-2. The resulting emotion-labelled conversations are stored in the folder results_train or results_test in the fileemotion_labelled_data.json
-
cause_prediction_training.ipynb
: fine-tunes anothermeta-llama/Llama-2-13b-chat-hf
model to predict the cause utterances for a particular utterance in a given conversation. Now, as context, we provide the entire conversation but with all the predicted emotion labels as this adds useful information for guiding cause prediction. We essentially treat this as a two-step process. The model is trained to output a list of cause utterance ids. -
cause_inference.ipynb
: performs inference on test data using the fine-tuned cause predictor. The final results are stored in the same results_train or results_test folder by the nameSubtask_2_pred.json
.
generate_input.py
: is used for generating the train, test, and validation splits and adds the video_name
to each utterance.
-
Both
emotion_recognition_training
andcause_prediction_training
used one Nvidia A100 40GB GPU for training. (Available on Google Colab Pro priced at $11.8/month) -
We use
accelerate
library for offloading the model to CPU and disk. (See: accelerate) -
Due to memory constraints, we use Quantized Low-Rank Adaptation for fine-tuning a 4-bit quantized Llama-2 model using
bitsandbytes
library. (See : bitsandbytes) -
We use
peft
library for parameter efficient fine-tuning where we define the configuration for LoRA. (See: peft) -
Supervised fine-tuning is performed using
trl
library which providesSFTTrainer
for performing supervised fine-tuning step of RLHF. (See: trl) -
Inference is performed using two Tesla T4 16GB GPUs. (Available on Kaggle for free (30 hrs/month))
Note: All the detailed prompts for both approaches are provided in each of the notebooks wherever used.