This repository provides an implementation of an unsupervised method for extractive multi-document summarization. The approach is based on the one proposed in Lamsiyah et al. (2021). It utilizes two distinct techniques for sentence embeddings: transformer-based and compression-based. Additionally, a scoring system ranks sentences according to their significance. The results are presented in the tables below.
Embeddings from pre-trained transformer models are used in this method. The following embeddings are considered:
- Pooler Output: If available, the pooler output embeddings from the transformer model are used.
- [CLS] Token: If the pooler output is not available, the [CLS] (or its equivalent) token embeddings from the transformer model are used.
Sentences are tokenized and then passed through the transformer model. These tokenized sentences are padded and truncated to a specified maximum length.
Compression-based embeddings are derived using the gzip
compression algorithm. The distance between two sentences is ascertained by compressing the combined sentence and contrasting it with individual compressions. This method offers a similarity measure between sentences. It's an adaptation of the approach presented in Jiang et al. (2023), modified for extractive summarization.
The approach can be parallelized in two distinct ways to leverage multiple cores. Firstly, by parallelizing the sentences within a single sample, and secondly, by directly parallelizing the samples themselves. The latter method is employed based on the number of available cores. The decision to use the latter method is influenced by the fact that the average number of sentences per sample is significantly smaller than the total number of samples.
The content relevance score for a sentence
Where:
-
$\vec{S_{D_i}}$ represents the embedding vector of sentence$S_i$ . -
$\vec{C_D}$ denotes the centroid embedding vector of cluster$D$ .
The novelty score for a sentence
Where:
-
$\text{sim}(S_i, S_k)$ indicates the similarity between sentence$S_i$ and other sentences in cluster$D$ , calculated as:
-
$l$ is the index of the sentence most similar to$S_i$ in cluster$D$ .
The position score for a sentence
Where:
-
$p(S_{d_i})$ is the position of sentence$S$ in document$d$ , starting from 1. -
$M_d$ is the total number of sentences in document$d$ .
The final score for a sentence
Subject to:
The results of both embeddings, when compared to the lead baseline (i.e., selecting the first $ m $ sentences as the summary), are displayed in the tables below. These results are derived from subsets of the CNN/DailyMail and PubMed datasets. The rouge metric from evaluate
was used, which tends to yield slightly lower scores than pyrouge
-- in my experience. After tokenizing the texts into sentences using spacy
, the average summary length, in terms of the number of sentences, is
The transformer-based embeddings generally outperform the compression-based embeddings across all metrics. However, the latter are faster to compute, making them a suitable choice for larger datasets. It's noteworthy that the lead baseline performs better on the news dataset, where initial sentences are typically more informative than those in the article's middle. Adjusting hyperparameters for scoring and using larger dataset subsets could produce more robust results.
Method | Rouge-1 ↑ | Rouge-2 ↑ | Rouge-L ↑ | Time (s) ↓ |
---|---|---|---|---|
lead | 34.57 | 14.54 | 22.42 | - |
all-mpnet-base-v2 | 33.37 | 13.44 | 21.35 | 643 |
gzip | 32.03 | 12.49 | 21.06 | 399 |
Method | Rouge-1 ↑ | Rouge-2 ↑ | Rouge-L ↑ | Time (s) ↓ |
---|---|---|---|---|
lead | 21.52 | 9.41 | 14.94 | - |
all-mpnet-base-v2 | 41.33 | 16.91 | 21.31 | 1536 |
gzip | 36.67 | 12.24 | 19.35 | 877 |
To evaluate the models and replicate the results, execute the following:
python clustsum/eval.py \
--method 'transformer' \ # 'transformer' or 'compression'
--checkpoint 'sentence-transformers/all-mpnet-base-v2' \
--embedding_from 'pooler' \ # 'pooler' or 'cls' or define it yourself
--max_length 384 \
--dataset 'pubmed' \ # 'pubmed' or 'cnn_dailymail' or define it yourself
--subset 1000 \
--batch_size 2 \
--device 'cuda' \ # 'cuda' or 'cpu'
--sum_size 8 \
--tau 0.95 \
--alpha 0.6 \
--beta 0.2 \
--gamma 0.2
To play around with single samples, take a look at the playground.ipynb
notebook.