This is the official repository for the paper Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges.
This repository is under construction. Please be patient until we add more information. Thank you!
The paper documents the detailed pipeline of this systematic literature review. This figure gives you an overview of how we sample 244 works relevant to text generation.
Our literature review identifies five most prominent areas related to text generation, namely open-ended text generation, summarization, translation, paraphrasing, and question answering.
Task | Description |
---|---|
Open-ended text generation | Newly generated text is iteratively conditioned on the previous context. |
Summarization | Generating a text from one or more texts conveying information in a shorter format. |
Translation | Converting a source text in language A to a target language B. |
Paraphrasing | Generating text that has (approximately) identical meaning but uses different words or structures. |
Question answering | Takes a question as input text and outputs a streamlined answer or a list of possible answers. |
For each of these tasks, we identify major sub-tasks and relevant challenges.
We provide an overview of model-free and model-based automatic metrics as well as methodologies for human evaluation. "Used" marks the number of papers that consider the metric in their publication from our 136 filtered Semantic Scholar documents (proposing, surveying, or applying).
We find that model-free n-gram-based metrics are by far the most used metrics within the works we cover. Model-based approaches are usually employed in a hybrid manner, combining embeddings with rule-based methods. Several works use human evaluation for performance measurements while often disregarding inter-annotator agreement scores.
Type | Category | Metric | Description | Used |
---|---|---|---|---|
Model-free | N-gram | BLEU | Textual overlap between source and reference (precision). | 69 |
ROUGE | Textual overlap between source and reference (recall). | 46 | ||
METEOR | Textual overlap between source and reference (precision and recall). | 32 | ||
CIDEr | Measures consensus on multiple reference texts. | 15 | ||
chrF++ | Character-based F-score computed using n-grams. | 13 | ||
Dist-n | Measures generation diversity by the percentage of distinct n-grams. | 8 | ||
NIST | Alters BLEU to also consider n-gram informativeness. | 6 | ||
Self-BLEU | Measures generation diversity by calculating BLEU between generated samples. | 2 | ||
Statistical | Perplexity | Fluency metric based on the likelihood of word sequences. | 23 | |
Word Error Rate | The rate of words that are different from a reference sequence based on the Levenshtein distance. | 11 | ||
Graph | SPICE | Measures the semantic similarity of two texts by the distance of their scene graphs. | 6 | |
Model-based | Hybrid | BERTScore | Contextual token similarity to measure textual overlap. | 13 |
MoverScore | Uses contextualized embeddings and captures both intersection and deviation from the reference for a similarity score. | 6 | ||
Word Mover Distance | Distance metric to measure the dissimilarity of two texts. | 2 | ||
Trained | BLEURT | Models human judgement on text quality. | 4 | |
BARTScore | Promptable metric that models human judgments on faithfulness besides precision and recall. | 3 | ||
Human | Performance | Likert Scale | Humans can choose on a scale, e.g., from 1 (horrible quality) to 5 (perfect quality). | 22 |
Pairwise Comparison | Humans choose the best example from two samples. | 10 | ||
Turing Test | Can quantify how distinguishable human text is from machine-generated text. | 6 | ||
Binary | Humans are answering binary questions with yes or no. | 3 | ||
Best-Worst Scaling | From a list of examples, humans are instructed to select the best and worst output. | 2 | ||
Agreement | Krippendorff Alpha | Measures the disagreement between annotators for nominal, ordinal, and metric data. | 4 | |
Fleiss Kappa | Measures the agreement on nominal data between a fixed pair of annotators. | 4 | ||
Pearson Correlation | Displays the agreement between annotators by measuring linear correlation. | 3 | ||
Spearman Correlation | Displays the monotonic relationships on ranked data. | 2 |
We recommend using Python 3.10 for this project.
First, install the requirements:
pip install -r requirements.txt
The project has multiple scripts included, each used for separate parts of the pipeline.
setup.py
: Defines the parameters used for searching and filtering the scientific works.tokens.py
: You need an API key to use the Semantic Scholar API. This is the place to put it.search.py
: The initial retrieval of scientific works through the Semantic Scholar API.filter.py
: The automated filtering process that selects the top five works per query and year by influential citation counts.
Run parts of the pipeline: 1) python search.py
and 2) python filter.py
.
If you use this repository or our paper for your research work, please cite us in the following way.
@misc{becker2024text,
title={Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges},
author={Jonas Becker and Jan Philip Wahle and Bela Gipp and Terry Ruas},
year={2024},
eprint={2405.15604},
archivePrefix={arXiv},
primaryClass={cs.CL}
}