Skip to content

A Domain Fine-Tuned, GPT-2 Auto-Completion Text Editor

Notifications You must be signed in to change notification settings

ixig/GPT-2_AutoComplete

Repository files navigation

Context Fine-Tuned, GPT-2 Auto-Completion Text Editor

Building a Better Text-completion Prediction Engine with Transformers

 

Five journalists have lost their lives in the course of performing their work in just the first month of the 2022 Russia-Ukraine War.

5 journalists killed in 1 month

The Internet is awash with examples of auto-completion fails. In particular, the next-word auto-completion suggestions offered by the generic virtual keyboard of a smartphone or tablet remains woefully bad even for casual conversational text, and even more so for writing articles in more specialized domains, such as international journalism or technical blogging.

How do I convert to ...?


I fine-tuned a Hugging Face DistilGPT2 Language Model on a corpus consisting of just one week of news articles from the NYT and AP News at the start of the war. The news articles were scraped using the NYT Developer APIs in conjunction with Beautiful Soup for HTML parsing.

Preprocessing was done using Python RegEx in the first step. Next, entities of interest were identified and white-listed. In the second step, NER using Spacy was used to normalize names of people and places to improve model performance. The final step was to combine all sentences from all articles into one line of text which will then be chunked into a context length of 128 tokens for fine-tuning a Causal Language Model.

The goal is to show a proof-of-concept of a context fine-tuned, text-prediction auto-completion editor for journalists who have to report from the field under trying conditions and severe time pressures.


In the video below, the curses-based auto-completion editor is being evaluated on a new article written by the NYT subsequent to corpus collection. It achieved a Top-10 match of 59%, meaning the writer only needed to type 2 out of every 5 words in the article when writing on a touchscreen device (e.g. phone or tablet). The Top-1 prediction rate was 28%.

IMAGE ALT TEXT HERE


The following table shows the results of evaluating the auto-completion engine on three new articles written by the NYT subsequent to corpus collection.

Article Top-1 %Hit Top-5 %Hit Top-10 %Hit
new1.txt 28% 51% 59%
new2.txt 28% 49% 57%
new3.txt 27% 48% 57%

This shows that, if necessary (e.g. due to limited screen space), limiting the predictions shown to just the Top-5 Hits still results in 85% of the editor's word prediction capabilities being retained.

Prediction results for the evaluation text can be found in /results. These files are ANSI-colored text files. To view, simply clear && cat xxx.ansi on a color-enabled terminal window, or Preview in VS Code using the ANSI Colors extension.


The video below shows the auto-completion engine being evaluated on new3.txt -- you can see (based on word coloring) the distribution of the Top-1/5/10 word hits and what types of words were successfully predicted. You can pass the -m option to predict.py to have it annotate the ranked predictions in the output text for all missed words.

IMAGE ALT TEXT HERE


Installation

git clone https://github.com/ixig/GPT-2_AutoComplete.git
cd GPT-2_AutoComplete
conda create --name gpt2ac --file requirements.txt
conda activate gpt2ac
python -m spacy download en_core_web_md

Usage

export NYT_API_KEY="..."

# Run the following command daily to collect more articles
python nyt.py dataset/orig

# Once the corpus has been assembled, pre-process the data
python preproc1.py dataset/orig/ dataset/preproc1/
python preproc2.py dataset/preproc1/ dataset/preproc2/
python combine.py dataset/preproc2/ dataset/combine.txt

# Fine-Tuning takes only ~8mins on a P100 GPU
jupyter-notebook UkraineTrainTextGen.ipynb

# Copy the trained model to your local machine
cp <hf_model_ckpt> ./distilgpt2-mlm

# Start typing a fresh article on the (rudimentary!) editor
python editor.py _new1.txt

# Evaluate the Performance of the model or out-of-sample article
# Make sure to run preproc1.py on it first (no need for preproc2)
python predict.py dataset/test/new1.txt _new1.txt -m

nyt.py :

usage: nyt.py [-h] [-k API_KEY] save_dir

Scrape NYT News Articles using NYT Developer APIs

positional arguments:
  save_dir    Saved Articles Directory

optional arguments:
  -k API_KEY  NYT Developer API Key

The API Key can be provided by setting the environmental variable NYT_API_KEY,
or by overriding using the -k command argument

preproc1.py :

usage: preproc1.py [-h] input_dir output_dir

Preprocess Text Files - Step #1

positional arguments:
  input_dir   Input Files Directory
  output_dir  Output Files Directory

preproc2.py :

usage: preproc2.py [-h] input_dir output_dir

Preprocess Text Files - Step #2

positional arguments:
  input_dir   Input Files Directory
  output_dir  Output Files Directory

combine.py :

usage: combine.py [-h] input_dir fout

Combine Individual Text Files into Single (one-line) File

positional arguments:
  input_dir   Input Files Directory
  fout        Output Combined Texts File

editor.py :

usage: editor.py [-h] [-d CKPT] fout

Text Auto-Completer Editor

positional arguments:
  fout        Output File

optional arguments:
  -h, --help  show this help message and exit
  -d CKPT     Model Checkpoint Directory

predict.py :

usage: predict.py [-h] [-m] [-d CKPT] fin fout

Evaluate performance of Text Auto-Completer on file

positional arguments:
  fin         Input File for Evaluation
  fout        Output File with Predictions Annotated

optional arguments:
  -h, --help  show this help message and exit
  -m          Annotate with Prediction Misses
  -d CKPT     Model Checkpoint Directory

About

A Domain Fine-Tuned, GPT-2 Auto-Completion Text Editor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published