The objective of this project was to get familiar with neural networks for language analysis and to compare the performance of several models on a summarization task.
The comparison was done in two stages:
- Without fine-tuning – testing how well pre-trained models perform out of the box.
- With fine-tuning – evaluating improvements after adapting the models specifically to the task.
To test the summarization abilities of different models, I chose to work with chapters from the novel Lord of the Mysteries.
-
I first scraped the chapters from novelfull and extracted the summaries from dragneelclub.
Both were then exported as a JSON file namedlotm_dataset.json(seescrap_chapter.pyfor the code). -
Next, the dataset was cleaned:
- Translator and editor notes were removed from the chapters.
- Advertisements were removed from the summaries.
The cleaned version was exported as
lotm_clean_dataset.json(seeclean_data_lotm.py).
-
I started by importing and splitting my cleaned dataset (
clean_data_lotm.py).
To speed up testing, I kept about 10 texts for applying the models and getting a rough idea of which one would be the most effective for summarization. -
The first run was done with 70 texts (see
model_comparison_previous_version.json) and gave similar results.
Later, when I added a seed to the split, I reduced the set to 10 to save time. Since the results were consistent, I kept this smaller set.
The rest of the dataset was exported as a JSON file (model_train.json). -
These 3 lines below (that can be found at the start of the
model_comparison.pyfile) create the list of models used. You can easily add or remove models by editing these lists.
causal_model = ["gpt2-medium","microsoft/phi-3-mini-128k-instruct"]
seq2seq_model = ["google/flan-t5-base","facebook/bart-large-cnn","google/pegasus-xsum","t5-large","allenai/led-base-16384"]
model_name = causal_model + seq2seq_model-
For texts that exceeded the input size of the models, I looked up different strategies such as :
- Using extractive models to shrink the text to fit the input size.
- Splitting the text into chunks matching the input size, summarizing each chunk, concatenating the partial summaries, and then summarizing the result.
I chose the chunk-based approach (seesummarize_by_chunk.pyin themy_classesfolder).
-
Finally, I created a DataFrame with three columns:
- Model name
- Chapter number
- Generated summary
The results were then stored in the JSON file
model_comparison.json.
Using the previous results, two main DataFrames were created:
df_benchmarkthat contains ROUGE and BERTSccore results comparing the generated summaries with the original text.df_gradingthat contains ROUGE and BERTScore results comparing the generated summaries with the human-made summaries.
To get an overall view of model performance, the mean scores were computed and stored in two additional DataFrames:
df_global_scorethat average scores fromdf_grading.df_benchmark_scorethat average scores fromdf_benchmark.
Our results for the F1 Bertscore and Rouge1 score can be found in the plot below.
Next, I attempted to train the three best-performing models (phi-3-mini-128k-instruct, bart-large-cnn, and led-base-16384) using my dataset (model_train.json):
- The texts were split into chunks and summarized in the same way as when the models were applied.
- The concatenated partial summaries (shorter than the model's input size) were exported to files named:
summary_for_training_{model_name}. - These concatenated summaries were then used as input to fine-tune the models, with the human-made summaries serving as the target (for the code, see the file
Train_model.py).
Important: Due to resource limitations, I was unable to fine-tune
phi-3-mini-128k-instruct, as it caused memory errors. Fine-tuning was therefore performed only on bart-large-cnn and led-base-16384.
Here, I slightly adjusted the code from model_comparison.py (see Trained_model_comparison.py) so that it could apply the fine-tuned models, just like it did with the pre-trained ones.
- The process is the same, and it exports a file similar to
model_comparison.jsonbut for the fine-tuned models, it is named:
Trained_model_comparison.json
At the beginning of the for loop, the model_tag can be set depending on whether you want to run a locally saved trained model obtained from running the file Train_model.py, or the fine-tuned model hosted on Hugging Face trained for this project.
for model_ in model_name:
torch.cuda.empty_cache()
# Use this option to load the locally saved model (from `Train_model.py`)
# model_tag = f"trained_{model_.replace('/', '_')}"
# Use this option to load the model uploaded on Hugging Face
model_tag = f"Lambda-ck/{model_.split('/')[1]}-lotm-fine-tuned"The fine-tuned models are compared in the same way as the pre-trained ones.
- The results are presented in the notebook
Results_presentation.ipynb. - To switch between pre-trained and fine-tuned model results, simply change the data source:
# For pre-trained models
df_predict = pd.read_json('data/model_comparison.json')
# For fine-tuned models
df_predict = pd.read_json('data/Trained_model_comparison.json')At the end of Results_presentation notebook we display using Ipywidget library a series of HTML pages containing our generated summaries to check whether our first grading method (using Rouge and Bertscore) hold against human evaluation.
You can find the generated summary (the reduced version with only 10 chapters summarized) on my website.
