What happens when a computer tries to write a children's story? Why, this, of course!
There is a long history of statistical language models in natural language processing, beginning with the humble n-gram model, which gathers statistics over a corpus about the co-occurrence of n (1, 2, 3, 4...) words or sub-word tokens, to the probabilistic context free grammar, which imposes statistical likelihood over a pre-defined grammar, to neural language models, which harness biologically inspired neural networks in a variety of architectures to tackle the problem of language modeling.
The task of these models is simple: Given some context, compute a distribution over what comes next, or equivalently, compute the likelihood of a given block of text. For example, a high-quality model would predict that "The dog went to the pound" is more likely than "The dog went to the Vatican," and given the context, "The dog went to the," it would likewise assign higher probability to "pound" than "Vatican."
Text generation is a simple extension of this idea. Starting with some context, the model iteratively predicts (or better yet, samples) the next token, adds that token to the context, predicts a further token, and on and on. Supposing that we chose "Vatican" at the first step above, we would add it to our context (now, "The dog went to the Vatican"), then likely predict either a comma or a period ("The dog went to the Vatican,"), then, skipping ahead a few steps ("The dog went to the Vatican, where he had tea with Pope "), we would likely predict e.g. "Pius" over "Smith" as the next token.
A dog having tea with the Pope is far from the most likely continuation of our original context, but it makes for an interesting story. In fact, humans rarely say the most likely thing, so in order to mimic the richness of human speech and tell an interesting story, sampling from the likelihood distribution of next tokens, rather than simply choosing the most likely one, is critical.
In the past few years, something remarkable has happened: deep neural network language models have progressed to a point where the text that they generate is starting to make sense. This is due to the advent of the transformer model, which has enabled the training of language models over huge corpora of high-quality text. The resultant pre-trained models (BERT, GPT-2, Transformer-XL, etc.) can then be adapted relatively cheaply to specific tasks using transfer learning.
This work is one such example. Using a pre-trained GPT-2 network (the "small" size, 12 layer version made available by the folks at Hugging Face) as a base, the model used in this project has been fine-tuned on approximately 100 children's stories made available through Project Gutenberg via the bAbI project. The result is model that can (or at least tries to) write stories in the combined styles of Charles Dickens, Harriet Elisabeth Beecher Stowe, Lewis Carroll, and 11 other well-loved authors.
In addition to being an example of transfer learning for text generation, this project is also an example of the power of React and Material-UI for building modern web apps. The entire mobile- and touch-friendly, shiny modern user interface was built with fewer than 1000 lines of highly readable code. Check it out!
Correct. As mentioned above, this project uses the small version of GPT-2, which has a "mere" 117M parameters 😏. By contrast, the XL model, the one they originally said was too dangerous to release, has 1558M parameters 😳.
So why use the small one?
- Training: The model was fine-tuned on Google Colab, which is an awesome free service. Unfortunately, the runtimes they provide don't have enough memory to handle anything larger than the small model, so fine-tuning is not free at this time.
- Hosting: The model is hosted as cheaply as possible (and as such quite suboptimally) on Google Cloud Run (see the backend readme for more details). Cloud Run is CPU-only and supports a maximum of 2GB memory per worker, so the larger models would not only be prohibitively slow (the small model already takes over a minute to generate a 500 token story), but actually wouldn't even fit in memory.
That said, if you think this is super cool and want to run the XL model on your Cloud TPUs (or sponsor me to set that up), you're more than welcome to. Just make sure to give credit where credit's due :wink:.