Skip to content

The official repository for the paper "Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges".

License

Notifications You must be signed in to change notification settings

jonas-becker/text-generation

Repository files navigation

Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

arXiv

This is the official repository for the paper Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges.

This repository is under construction. Please be patient until we add more information. Thank you!

Methodology

The paper documents the detailed pipeline of this systematic literature review. This figure gives you an overview of how we sample 244 works relevant to text generation.

Text Generation Tasks

Our literature review identifies five most prominent areas related to text generation, namely open-ended text generation, summarization, translation, paraphrasing, and question answering.

Task Description
Open-ended text generation Newly generated text is iteratively conditioned on the previous context.
Summarization Generating a text from one or more texts conveying information in a shorter format.
Translation Converting a source text in language A to a target language B.
Paraphrasing Generating text that has (approximately) identical meaning but uses different words or structures.
Question answering Takes a question as input text and outputs a streamlined answer or a list of possible answers.

For each of these tasks, we identify major sub-tasks and relevant challenges.

Evaluation Metrics

We provide an overview of model-free and model-based automatic metrics as well as methodologies for human evaluation. "Used" marks the number of papers that consider the metric in their publication from our 136 filtered Semantic Scholar documents (proposing, surveying, or applying).

We find that model-free n-gram-based metrics are by far the most used metrics within the works we cover. Model-based approaches are usually employed in a hybrid manner, combining embeddings with rule-based methods. Several works use human evaluation for performance measurements while often disregarding inter-annotator agreement scores.

Type Category Metric Description Used
Model-free N-gram BLEU Textual overlap between source and reference (precision). 69
ROUGE Textual overlap between source and reference (recall). 46
METEOR Textual overlap between source and reference (precision and recall). 32
CIDEr Measures consensus on multiple reference texts. 15
chrF++ Character-based F-score computed using n-grams. 13
Dist-n Measures generation diversity by the percentage of distinct n-grams. 8
NIST Alters BLEU to also consider n-gram informativeness. 6
Self-BLEU Measures generation diversity by calculating BLEU between generated samples. 2
Statistical Perplexity Fluency metric based on the likelihood of word sequences. 23
Word Error Rate The rate of words that are different from a reference sequence based on the Levenshtein distance. 11
Graph SPICE Measures the semantic similarity of two texts by the distance of their scene graphs. 6
Model-based Hybrid BERTScore Contextual token similarity to measure textual overlap. 13
MoverScore Uses contextualized embeddings and captures both intersection and deviation from the reference for a similarity score. 6
Word Mover Distance Distance metric to measure the dissimilarity of two texts. 2
Trained BLEURT Models human judgement on text quality. 4
BARTScore Promptable metric that models human judgments on faithfulness besides precision and recall. 3
Human Performance Likert Scale Humans can choose on a scale, e.g., from 1 (horrible quality) to 5 (perfect quality). 22
Pairwise Comparison Humans choose the best example from two samples. 10
Turing Test Can quantify how distinguishable human text is from machine-generated text. 6
Binary Humans are answering binary questions with yes or no. 3
Best-Worst Scaling From a list of examples, humans are instructed to select the best and worst output. 2
Agreement Krippendorff Alpha Measures the disagreement between annotators for nominal, ordinal, and metric data. 4
Fleiss Kappa Measures the agreement on nominal data between a fixed pair of annotators. 4
Pearson Correlation Displays the agreement between annotators by measuring linear correlation. 3
Spearman Correlation Displays the monotonic relationships on ranked data. 2

Setup

Install

We recommend using Python 3.10 for this project.

First, install the requirements: pip install -r requirements.txt

Code Structure

The project has multiple scripts included, each used for separate parts of the pipeline.

  1. setup.py: Defines the parameters used for searching and filtering the scientific works.
  2. tokens.py: You need an API key to use the Semantic Scholar API. This is the place to put it.
  3. search.py: The initial retrieval of scientific works through the Semantic Scholar API.
  4. filter.py: The automated filtering process that selects the top five works per query and year by influential citation counts.

Run

Run parts of the pipeline: 1) python search.py and 2) python filter.py.


Citation

If you use this repository or our paper for your research work, please cite us in the following way.

@misc{becker2024text,
      title={Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges}, 
      author={Jonas Becker and Jan Philip Wahle and Bela Gipp and Terry Ruas},
      year={2024},
      eprint={2405.15604},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

The official repository for the paper "Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages