JMI at SemEval 2024 Task-3: MECPE

This repository contains all the code and usage instructions related to models proposed and implemented by team JMI for the Multimodal Emotion Cause Pair Extraction task at SemEval 2024.

We have experimented with two LLM-based approaches for the subtask-2 i.e. the multimodal task that both gave competitive results.

GPT-4 Vision for Video Captioning and Few-shot prompting GPT-3.5 using RAG for Emotion and Cause Prediction (Best Results)
Fine-tuning LLaMA2 for Emotion and Cause Prediction.

Repo Structure

├── data
│   ├── submission
│   │   └── Few-Shot GPT using RAG
│   │       ├── gpt-output
│   │       └── gpt-out-self-added
│   └── text
├── GPT-RAG
│   ├── video_captioning.ipynb
│   ├── index-creation.ipynb
│   ├── emotion_prediction.ipynb
│   ├── cause_prediction.ipynb
│   ├── cause_analysis.ipynb
│   ├── requirements.txt
│   └── eval_videos
├── Llama2
│   ├── cause_inference.ipynb
│   ├── cause_prediction_training.ipynb
│   ├── emotion_inference.ipynb
│   ├── emotion_recognition_training.ipynb
│   ├── generate_input.py
│   └── requirements.txt
├── old_experiments
├── README.md

data/submission folder contains the submitted json file for subtask-2. gpt-out-self-added is the best submission.

GPT-RAG Usage

Move into the GPT-RAG folder and install all the dependencies for the project:

  cd GPT-RAG
  pip install -r requirements.txt

Download the videos of eval set from Google Drive Link and extract it into GPT-RAG/eval_videos/ folder.

Download the following zip file: Google Drive Link
Extract the contents of zip file into the GPT-RAG folder.

Contents of zip file:

It contains all the intermediate processed outputs such as:

frames : Folder containing images for each utterance videos from eval set. For an utterance video, 9 equidistant frames were sampled from the video and were placed in row-major order in a 3x3 grid to make up a single image. video_captioning.ipynb contains the code to generate the images.
all_emotion_index : FAISS index containing OpenAI Text-embedding-ada-002-v2 embeddings for conversations from training set which contained emotional utterances for all 6 emotions. index-creation.ipynb contains the code to generate the index.
cause_windows : FAISS index containing OpenAI Text-embedding-ada-002-v2 embeddings for conversational windows from training set for each 6 different emotional utterances at 3 different positions: beginning, middle and end.
- beginning: For an emotional utterance at beginning of conversation, a window of 3 utterances is created containing the current utterance and the next 2 utterances.
- middle: For an emotional utterance in middle of conversation, a window of 8 utterances is created containing the current utterance, previous 5 utterances and the next 2 utterances.
- end: For an emotional utterance at end of conversation, a window of 6 utterances is created containing the current utterance and the previous 5 utterances. In total, 18 indices are created. index-creation.ipynb contains the code to generate all the windows and index.
eval_raw_out.json : JSON file containing batched captions for each video conversation of eval set. Due to Rate limits of OpenAI gpt-4-vision-preview API, 10 images can be sent per requests. These 10 images were the generated images for each utterance videos from frames folder for a conversation. Thus a conversation with 14 utterances would be processed as 2 requests with 10 utterance sequence being captioned and then the next 4. video_captioning.ipynb contains the code to generate the captions.
eval_proc_out.json : JSON file containing final captions for each video conversation of eval set. The batched outputs are combined by prompting OpenAI gpt-3.5-turbo-1106 API to generate a coherent caption for whole sequence. video_captioning.ipynb contains the code to postprocess the generated captions.
emotion_explainations.json : JSON file containing explanations generated by prompting OpenAI gpt-3.5-turbo-1106 API for annotation of each emotion in conversations from training set containing all emotions. index-creation.ipynb contains the code to generate the explanations.
cause_windows.json : JSON file containing all the different windows created for each emotional utterance along with a prompt asking for explanations for each cause annotation. index-creation.ipynb contains the code to generate the file.
emotion_eval_labelled.json : JSON file containing annotated emotions for each conversation in eval set. emotion_prediction.ipynb contains the code to annotate emotions.
cur_anno.json : JSON file containing annotated emotion-cause pairs for eval set. cause_prediction.json contains the code to annotate causes for each emotional utterance.
cur_anno_same_added.json : JSON file which is a postprocessed cur_anno.json where for each emotional utterance in all conversations of eval set, if self-cause is not present in the emotion-cause pairs, then self-cause is added. cause_prediction.json contains the code to perform this postprocessing step. This is the best prediction file

Notebooks:

video_captioning.ipynb : generates images for each video utterances. Prompts GPT-4 Vision to generate captions for conversations and store them in a JSON file.
index-creation.ipynb : generates FAISS index for conversation containing all emotions and also for windows of each emotional utterance in conversations from training set.
emotion_prediction.ipynb : generates prediction for emotions by prompting OpenAI gpt-3.5-turbo-1106 API to annotate emotions in a conversation. For a conversation from eval set, the closest conversation from all_emotion_index is retrieved along with its explanation from emotion_explainations.json which are provided as a few-shot example in the prompt. Also video caption for whole conversation is provided as part of context in prompt.
cause_prediction.ipynb : generates prediction of emotion-cause pairs for a conversation which has emotions annotated by prompting OpenAI gpt-3.5-turbo-1106 API. For an annotated emotional utterance with emotion E, a window is created around the utterance based on its position P (beg/mid/end described above), and then three closest windows from training set are retrieved from FAISS index for emotion E and position P. GPT-3.5 is prompted to give explanations for each 3 closest window for their causes. These 3 windows are provided as a few-shot example for cause prediction prompt. Video caption for whole conversation is also provided as part of context in prompt.
cause_analysis.ipynb: contains exploratory data analysis for distribution of causes and their positions with respect to emotional utterances.

Fine-tuned Llama-2 usage

Move into the Llama2 folder and install all the dependencies for the project:

  cd Llama2
  pip install -r requirements.txt

Notebooks:

emotion_recognition_training.ipynb : fine-tunes a meta-llama/Llama-2-13b-chat-hf LLM to predict the emotion label of a particular utterance in a given conversation. As context, for predicting the emotion label of each utterance, we provide the entire conversation along with the speaker information to guide the prediction.
emotion_inference.ipynb : for performing inference on test data using the fine-tuned Llama-2. The resulting emotion-labelled conversations are stored in the folder results_train or results_test in the file emotion_labelled_data.json
cause_prediction_training.ipynb : fine-tunes another meta-llama/Llama-2-13b-chat-hf model to predict the cause utterances for a particular utterance in a given conversation. Now, as context, we provide the entire conversation but with all the predicted emotion labels as this adds useful information for guiding cause prediction. We essentially treat this as a two-step process. The model is trained to output a list of cause utterance ids.
cause_inference.ipynb : performs inference on test data using the fine-tuned cause predictor. The final results are stored in the same results_train or results_test folder by the name Subtask_2_pred.json.

Utils:

generate_input.py : is used for generating the train, test, and validation splits and adds the video_name to each utterance.

Implementation Details:

Both emotion_recognition_training and cause_prediction_training used one Nvidia A100 40GB GPU for training. (Available on Google Colab Pro priced at $11.8/month)
We use accelerate library for offloading the model to CPU and disk. (See: accelerate)
Due to memory constraints, we use Quantized Low-Rank Adaptation for fine-tuning a 4-bit quantized Llama-2 model using bitsandbytes library. (See : bitsandbytes)
We use peft library for parameter efficient fine-tuning where we define the configuration for LoRA. (See: peft)
Supervised fine-tuning is performed using trl library which provides SFTTrainer for performing supervised fine-tuning step of RLHF. (See: trl)
Inference is performed using two Tesla T4 16GB GPUs. (Available on Kaggle for free (30 hrs/month))

Note: All the detailed prompts for both approaches are provided in each of the notebooks wherever used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JMI at SemEval 2024 Task-3: MECPE

Repo Structure

GPT-RAG Usage

Contents of zip file:

Notebooks:

Fine-tuned Llama-2 usage

Notebooks:

Utils:

Implementation Details:

Authors

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
GPT-RAG		GPT-RAG
Llama2		Llama2
data		data
old_experiments		old_experiments
.gitignore		.gitignore
README.md		README.md

cmooncs/SemEval-2024_MultiModal_ECPE

Folders and files

Latest commit

History

Repository files navigation

JMI at SemEval 2024 Task-3: MECPE

Repo Structure

GPT-RAG Usage

Contents of zip file:

Notebooks:

Fine-tuned Llama-2 usage

Notebooks:

Utils:

Implementation Details:

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages