Skip to content

Latest commit

 

History

History
105 lines (83 loc) · 8.35 KB

language_modeling.md

File metadata and controls

105 lines (83 loc) · 8.35 KB

Language modeling

Language modeling is the task of predicting the next word or character in a document.

* Indicates models using dynamic evaluation.

Word Level Models

Penn Treebank

A common evaluation dataset for language modeling ist the Penn Treebank, as pre-processed by Mikolov et al. (2010). The dataset consists of 929k training words, 73k validation words, and 82k test words. As part of the pre-processing, words were lower-cased, numbers were replaced with N, newlines were replaced with <eos>, and all other punctuation was removed. The vocabulary is the most frequent 10k words with the rest of the tokens replaced by an <unk> token. Models are evaluated based on perplexity, which is the average per-word log-probability (lower is better).

Model Validation perplexity Test perplexity Paper / Source
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* 48.33 47.69 Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
AWD-LSTM + dynamic eval (Krause et al., 2017)* 51.6 51.1 Dynamic Evaluation of Neural Sequence Models
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* 53.9 52.8 Regularizing and Optimizing LSTM Language Models
AWD-LSTM-DOC (Takase et al., 2018) 54.12 52.38 Direct Output Connection for a High-Rank Language Model
AWD-LSTM-MoS (Yang et al., 2018) 56.54 54.44 Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
AWD-LSTM (Merity et al., 2017) 60.0 57.3 Regularizing and Optimizing LSTM Language Models

WikiText-2

WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.

Model Validation perplexity Test perplexity Paper / Source
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* 42.41 40.68 Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
AWD-LSTM + dynamic eval (Krause et al., 2017)* 46.4 44.3 Dynamic Evaluation of Neural Sequence Models
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* 53.8 52.0 Regularizing and Optimizing LSTM Language Models
AWD-LSTM-DOC (Takase et al., 2018) 60.29 58.03 Direct Output Connection for a High-Rank Language Model
AWD-LSTM-MoS (Yang et al., 2018) 63.88 61.45 Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
AWD-LSTM (Merity et al., 2017) 68.6 65.8 Regularizing and Optimizing LSTM Language Models

WikiText-103

WikiText-103 The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.

{% include table.html results=site.data.language_modeling.Word_Level.WikiText_103 scores='Validation perplexity,Test perplexity' %}

Model Validation perplexity Test perplexity Paper / Source Code
LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) 29.0 29.2 Fast Parametric Learning with Activation Memorization
LSTM + Hebbian (Rae et al., 2018) 34.1 34.3 Fast Parametric Learning with Activation Memorization
LSTM (Rae et al., 2018) 36.0 36.4 Fast Parametric Learning with Activation Memorization
Gated CNN (Dauphin et al., 2016) - 37.2 Language modeling with gated convolutional networks
Temporal CNN (Bai et al., 2018) - 45.2 Convolutional sequence modeling revisited
LSTM (Graves et al., 2014) - 48.7 Neural turing machines

Character Level Models

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwik8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.

Model Bit per Character (BPC) Number of params Paper / Source
Character Transformer Model (Al-Rfou et al., 2018) 1.06 235M Character-Level Language Modeling with Deeper Self-Attention
mLSTM + dynamic eval (Krause et al., 2017)* 1.08 46M Dynamic Evaluation of Neural Sequence Models
3 layer AWD-LSTM (Merity et al., 2018) 1.232 47M An Analysis of Neural Language Modeling at Multiple Scales
Large FS-LSTM-4 (Mujika et al., 2017) 1.245 47M Fast-Slow Recurrent Neural Networks
Large mLSTM +emb +WN +VD (Krause et al., 2017) 1.24 46M Multiplicative LSTM for sequence modelling
FS-LSTM-4 (Mujika et al., 2017) 1.277 27M Fast-Slow Recurrent Neural Networks
Large RHN (Zilly et al., 2016) 1.27 46M Recurrent Highway Networks

Text8

The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.

Model Bit per Character (BPC) Number of params Paper / Source
Character Transformer Model (Al-Rfou et al., 2018) 1.13 235M Character-Level Language Modeling with Deeper Self-Attention
mLSTM + dynamic eval (Krause et al., 2017)* 1.19 45M Dynamic Evaluation of Neural Sequence Models
Large mLSTM +emb +WN +VD (Krause et al., 2016) 1.27 45M Multiplicative LSTM for sequence modelling
Large RHN (Zilly et al., 2016) 1.27 46M Recurrent Highway Networks
LayerNorm HM-LSTM (Chung et al., 2017) 1.29 35M Hierarchical Multiscale Recurrent Neural Networks
BN LSTM (Cooijmans et al., 2016) 1.36 16M Recurrent Batch Normalization
Unregularised mLSTM (Krause et al., 2016) 1.40 45M Multiplicative LSTM for sequence modelling

Penn Treebank

The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.

Model Bit per Character (BPC) Number of params Paper / Source
3 layer AWD-LSTM (Merity et al., 2018) 1.175 13.8M An Analysis of Neural Language Modeling at Multiple Scales
6 layer QRNN (Merity et al., 2018) 1.187 13.8M An Analysis of Neural Language Modeling at Multiple Scales
FS-LSTM-4 (Mujika et al., 2017) 1.190 27M Fast-Slow Recurrent Neural Networks
FS-LSTM-2 (Mujika et al., 2017) 1.193 27M Fast-Slow Recurrent Neural Networks
NASCell (Zoph & Le, 2016) 1.214 16.3M Neural Architecture Search with Reinforcement Learning
2-Layer Norm HyperLSTM (Ha et al., 2016) 1.219 14.4M HyperNetworks

Go back to the README