DD2418 Language Engineering Project - Movie Summarization

This project is a part of the DD2418 Language Engineering course at KTH. The goal of the project is to build a narrative text summarizer by fine-tuning Transformers (namely, Torch T5 and BART models), to generate movie and TV episode summaries out of corresonding plots.

Project Overview

The project focuses on training and evaluating Transformer-based models for movie summarization. The main steps involved in the project are:

Data Preparation: Preprocessing the Narrasum dataset, including cleaning, tokenization, and formatting the data according to the T5/BART model input requirements.
Model Training: Fine-tuning the T5 and BART models using the preprocessed data. The models can be trained on a GPU for improved performance (see train_X_with_cuda.py).
Evaluation: During the evaluation phase, we will assess the performance of the trained models for text summarization using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics, which are commonly used for evaluating summarization tasks.

Getting Started

To train and evaluate the models, you will need to install the required dependencies

Run the following command to install the dependencies:

pip install -r requirements.txt

Training the Models

Run this to prepare the data and start the training process for the T5 model:

python t5/train_t5.py

And this to prepare the data and start the training process for the BART model:

python bart/train_bart.py

Evaluating the Models

Run this to evaluate the T5 model:

python t5/testing_t5.py

And this to evaluate the BART model:

python bart/testing_bart.py

Adjust hyperparameters

The hyperparameters for the models can be adjusted in the config.py files in the t5 and bart folders.

Resources

Torch documentation: Official documentation for the Torch framework.
Hugging Face Transformers documentation: Documentation for the Transformers library, which provides pre-trained models and tools for natural language processing tasks.
Narrasum dataset The Narrasum dataset is a collection of movie plot summaries and their corresponding summaries. The dataset is used for training and evaluating the models.
ROUGE metrics: The ROUGE metrics are used for evaluating the performance of the models for text summarization.
T5 model: The T5 model is a Transformer-based model that can be used for text summarization.
BART model: The BART model is a Transformer-based model that can be used for text summarization.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
bart		bart
data		data
plotting		plotting
t5		t5
test_predictions		test_predictions
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DD2418 Language Engineering Project - Movie Summarization

Project Overview

Getting Started

Training the Models

Evaluating the Models

Adjust hyperparameters

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

faridashahata/Language-Engineering

Folders and files

Latest commit

History

Repository files navigation

DD2418 Language Engineering Project - Movie Summarization

Project Overview

Getting Started

Training the Models

Evaluating the Models

Adjust hyperparameters

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages