This is a curated list of papers that I have encountered in some capacity and deem worth including in the NLP practitioner's library. Some papers may appear in multiple sub-categories, if they don't fit easily into one of the boxes.
PRs are absolutely welcome! Direct any correspondence/questions to @mihail_eric.
Some special designations for certain papers:
π‘ LEGEND: This is a game-changer in the NLP literature and worth reading.
πΌ RESOURCE: This paper introduces some dataset/resource and hence may be useful for application purposes.
- (2000) A Statistical Part-of-Speech Tagger
- TLDR: Seminal paper demonstrating a powerful HMM-based POS tagger. Many tips and tricks for building such classical systems included.
 
- (2003) Feature-rich part-of-speech tagging with a cyclic dependency network
- TLDR: Proposes a number of powerful linguistic features for building a (then) SOTA POS-tagging system
 
- (2015) Bidirectional LSTM-CRF Models for Sequence Tagging
- TLDR: Proposes an element sequence-tagging model combining neural networks with conditional random fields, achieving SOTA in POS-tagging, NER, and chunking.
 
- (2003) Accurate unlexicalized parsing π‘
- TLDR: Beautiful paper demonstrating that unlexicalized probabilistic context free grammars can exceed the performance of lexicalized PCFGs.
 
- (2006) Learning Accurate, Compact, and Interpretable Tree Annotation
- TLDR: Fascinating result showing that using expectation-maximization you can automatically learn accurate and compact latent nonterminal symbols for tree annotation, achieving SOTA.
 
- (2014) A Fast and Accurate Dependency Parser using Neural Networks
- TLDR: Very important work ushering in a new wave of neural network-based parsing architectures, achieving SOTA performance as well as blazing parsing speeds.
 
- (2014) Grammar as a Foreign Language
- TLDR: One of the earliest demonstrations of the effectiveness of seq2seq architectures with attention on constituency parsing, achieving SOTA on the WSJ corpus. Also showed the importance of data augmentation for the parsing task.
 
- (2015) Transition Based Dependency Parsing with Stack Long Short Term Memory
- TLDR: Presents stack LSTMs, a neural parser that successfully neuralizes the traditional push-pop operations of transition-based dependency parsers, achieve SOTA in the process.
 
- (2005) Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling
- TLDR: Using cool Monte Carlo methods combined with a conditional random field model, this work achieves a huge error reduction in certain information extraction benchmarks.
 
- (2015) Bidirectional LSTM-CRF Models for Sequence Tagging
- TLDR: Proposes an element sequence-tagging model combining neural networks with conditional random fields, achieving SOTA in POS-tagging, NER, and chunking.
 
- (2010) A multi-pass sieve for coreference resolution π‘
- TLDR: Proposes a sieve-based approach to coreference resolution that for many years (until deep learning approaches) was SOTA.
 
- (2015) Entity-Centric Coreference Resolution with Model Stacking
- TLDR: This work offers a nifty approach to building coreference chains iteratively using entity-level features.
 
- (2016) Improving Coreference Resolution by Learning Entity-Level Distributed Representations
- TLDR: One of the earliest effective approaches to using neural networks for coreference resolution, significantly outperforming the SOTA.
 
- 
(2012) Baselines and Bigrams: Simple, Good Sentiment and Topic Classification - TLDR: Very elegant paper, illustrating that simple Naive Bayes models with bigram features can outperform more sophisticated methods like support vector machines on tasks such as sentiment analysis.
 
- 
(2013) Recursive deep models for semantic compositionality over a sentiment treebank πΌ - TLDR: Introduces the Stanford Sentiment Treebank, a wonderful resource for fine-grained sentiment annotation on sentences. Also introduces the Recursive Neural Tensor Network, a neat linguistically-motivated deep learning architecture.
 
- 
(2014) Distributed Representations of Sentences and Documents - TLDR: Introduces ParagraphVector an unsupervised that learns fixed representations of paragraphs, using ideas inspired from Word2Vec. Achieves then SOTA on sentiment analysis on Stanford Sentiment Treebank and the IMDB dataset.
 
- 
(2019) Unsupervised Data Augmentation for Consistency Training - TLDR: Introduces Unsupervised Data Augmentation, a method for efficient training on a small number of training examples. Paper applies UDA to IMDB sentiment analysis dataset, achieving SOTA with only 30 training examples.
 
- (2007) Natural Logic for Textual Inference
- TLDR: Proposes a rigorous logic-based approach to the problem of textual inference called natural logic. Very cool mathematically-motivated transforms are used to deduce the relationship between phrases.
 
- (2008) An Extended Model of Natural Logic
- TLDR: Extends previous work on natural logic for inference, adding phenomena such as semantic exclusion and implicativity to enhance the premise-hypothesis transform process.
 
- (2014) Recursive Neural Networks Can Learn Logical Semantics
- TLDR: Demonstrates that deep learning architectures such as neural tensor networks can effectively be applied to natural language inference.
 
- (2015) A large annotated corpus for learning natural language inference πΌ
- TLDR: Introduces the Stanford Natural Language Inference corpus, a wonderful NLI resource larger by two orders of magnitude over previous datasets.
 
- (1993) The Mathematics of Statistical Machine Translation π‘
- TLDR: Introduces the IBM machine translation models, several seminal models in statistical MT.
 
- (2002) BLEU: A Method for Automatic Evaluation of Machine Translation πΌ
- TLDR: Proposes BLEU, the defacto evaluation technique used for machine translation (even today!)
 
- (2003) Statistical Phrase-Based Translation
- TLDR: Introduces a phrase-based translation model for MT, doing nice analysis that demonstrates why phrase-based models outperform word-based ones.
 
- (2014) Sequence to Sequence Learning with Neural Networks π‘
- TLDR: Introduces the sequence-to-sequence neural network architecture. While only applied to MT in this paper, it has since become one of the cornerstone architectures of modern natural language processing.
 
- (2015) Neural Machine Translation by Jointly Learning to Align and Translate π‘
- TLDR: Extends previous sequence-to-sequence architectures for MT by using the attention mechanism, a powerful tool for allowing a target word to softly search for important signal from the source sentence.
 
- (2015) Effective approaches to attention-based neural machine translation
- TLDR: Introduces two new attention mechanisms for MT, using them to achieve SOTA over existing neural MT systems.
 
- (2016) Neural Machine Translation of Rare Words with Subword Units
- TLDR: Introduces byte pair encoding, an effective technique for allowing neural MT systems to handle (more) open-vocabulary translation.
 
- (2016) Pointing the Unknown Words
- TLDR: Proposes a copy-mechanism for allowing MT systems to more effectively copy words from a source context sequence.
 
- (2016) Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
- TLDR: A wonderful case-study demonstrating what a production-capacity machine translation system (in this case that of Google) looks like.
 
- (2013) Semantic Parsing on Freebase from Question-Answer Pairs π‘ πΌ
- TLDR: Proposes an elegant technique for semantic parsing that learns directly from question-answer pairs, without the need for annotated logical forms, allowing the system to scale up to Freebase.
 
- (2014) Semantic Parsing via Paraphrasing
- TLDR: Develops a unique paraphrase model for learning appropriate candidate logical forms from question-answer pairs, improving SOTA on existing Q/A datasets.
 
- (2015) Building a Semantic Parser Overnight πΌ
- TLDR: Neat paper showing that a semantic parser can be built from scratch starting with no training examples!
 
- (2015) Bringing Machine Learning and Computational Semantics Together
- TLDR: A nice overview of a computational semantics framework that uses machine learning to effectively learn logical forms for semantic parsing.
 
- 
(2016) A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task - TLDR: A great wake-up call paper, demonstrating that SOTA performance can be achieved on certain reading comprehension datasets using simple systems with carefully chosen features. Don't forget non-deep learning methods!
 
- 
(2017) SQuAD: 100,000+ Questions for Machine Comprehension of Text πΌ - TLDR: Introduces the SQUAD dataset, a question-answering corpus that has become one of the defacto benchmarks used today.
 
- 
- TLDR: Introduces an unsupervised method that can answer incomplete questions over Knowledge Graph by maintaining conversation context using entities and predicates seen so far and automatically inferring missing or ambiguous pieces for follow-up questions.
 
- 
(2019) Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering - TLDR: Introduces a new graph based recurrent retrieval approach, which retrieves reasoning paths over the Wikipedia graph to answer multi-hop open-domain questions.
 
- 
(2019) Abductive Commonsense Reasoning - TLDR: Introduces a dataset and conceptualizes two new tasks for Abductive Reasoning: Abductive Natural Language Inference and Abductive Natural Language Generation.
 
- 
(2020) Differentiable Reasoning over a Virtual Knowledge Base - TLDR: Introduces a neural module for multi-hop Question Answering, which is differentiable and can be trained end-to-end.
 
- 
(2020) Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering - TLDR: Presents an approach to open-domain question answering that relies on retrieving support passages before processing them with a generative model
 
- 
(2020) DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering - TLDR: Presents a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers reducing runtime compute.
 
- 
(2020) Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering - TLDR: Presents introduce a simple, fast, and unsupervised iterative evidence retrieval method for multi-hop Question Answering.
 
- 
- TLDR: Presents approach to generate Question in semi-autoregressive using two graphs based on passages and answers.
 
- 
(2020) What Question Answering can Learn from Trivia Nerds - TLDR: Presents insights into what Question Answering task can learn from Trivia tournaments.
 
- 
(2020) Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings - TLDR: Presents an approach effective in performing multi-hop KGQA over sparse Knowledge Graphs.
 
- (2004) ROUGE: A Package for Automatic Evaluation of Summaries πΌ
- TLDR: Introduces ROUGE, an evaluation metric for summarization that is used to this day on a variety of sequence transduction tasks.
 
- (2004) TextRank: Bringing Order into Texts
- TLDR: Applying graph-based text analysis techniques based on PageRank, the authors achieve SOTA results on keyword extraction and very strong extractive summarization results in an unsupervised fashion.
 
- (2015) Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems
- TLDR: Proposes a neural natural language generator that jointly optimises sentence planning and surface realization, outperforming other systems on human eval.
 
- (2016) Pointing the Unknown Words
- TLDR: Proposes a copy-mechanism for allowing MT systems to more effectively copy words from a source context sequence.
 
- (2017) Get To The Point: Summarization with Pointer-Generator Networks
- TLDR: This work offers an elegant soft copy mechanism, that drastically outperforms the SOTA on abstractive summarization.
 
- (2020) A Generative Model for Joint Natural Language Understanding and Generation
- TLDR: This work presents a generative model which couples NLU and NLG through a shared latent variable, achieving state-of-the-art performance on two dialogue datasets with both flat and tree-structured formal representations
 
- (2020) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- TLDR: This work presents a generative model which couples NLU and NLG through a shared latent variable, achieving state-of-the-art performance on two dialogue datasets with both flat and tree-structured formal representations.
 
- (2011) Data-drive Response Generation in Social Media
- TLDR: Proposes using phrase-based statistical machine translation methods to the problem of response generation.
 
- (2015) Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems
- TLDR: Proposes a neural natural language generator that jointly optimises sentence planning and surface realization, outperforming other systems on human eval.
 
- (2016) How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation π‘
- TLDR: Important work demonstrating that existing automatic metrics used for dialogue woefully do not correlate well with human judgment.
 
- (2016) A Network-based End-to-End Trainable Task-oriented Dialogue System
- TLDR: Proposes a neat architecture for decomposing a dialogue system into a number of individually-trained neural network components.
 
- (2016) A Diversity-Promoting Objective Function for Neural Conversation Models
- TLDR: Introduces a maximum mutual information objective function for training dialogue systems.
 
- (2016) The Dialogue State Tracking Challenge Series: A Review
- TLDR: A nice overview of the dialogue state tracking challenges for dialogue systems.
 
- (2017) A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue
- TLDR: Shows that simple sequence-to-sequence architectures with a copy mechanism can perform competitively on existing task-oriented dialogue datasets.
 
- (2017) Key-Value Retrieval Networks for Task-Oriented Dialogue πΌ
- TLDR: Introduces a new multidomain dataset for task-oriented dataset as well as an architecture for softly incorporating information from structured knowledge bases into dialogue systems.
 
- (2017) Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings πΌ
- TLDR: Introduces a new collaborative dialogue dataset, as well as an architecture for representing structured knowledge via knowledge graph embeddings.
 
- (2017) Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning
- TLDR: Introduces a hybrid dialogue architecture that can be jointly trained via supervised learning as well as reinforcement learning and combines neural network techniques with fine-grained rule-based approaches.
 
- (1971) Procedures as a Representation for Data in a Computer Program for Understanding Natural Language
- TLDR: One of the seminal papers in computer science, introducing SHRDLU an early system for computers understanding human language commands.
 
- (2016) Learning language games through interaction
- TLDR: Introduces a novel setting for interacting with computers to accomplish a task where only natural language can be used to communicate with the system!
 
- (2017) Naturalizing a programming language via interactive learning
- TLDR: Very cool work allowing a community of workers to iteratively naturalize a language starting with a core set of commands in an interactive task.
 
- (1996) An Empirical Study of Smoothing Techniques for Language Modelling
- TLDR: Performs an extensive survey of smoothing techniques in traditional language modelling systems.
 
- (2003) A Neural Probabilistic Language Model π‘
- TLDR: A seminal work in deep learning for NLP, introducing one of the earliest effective models for neural network-based language modelling.
 
- (2014) One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling πΌ
- TLDR: Introduces the Google One Billion Word language modelling benchmark.
 
- (2015) Character-Aware Neural Language Models
- TLDR: Proposes a language model using convolutional neural networks that can employ character-level information, performing on-par with word-level LSTM systems.
 
- (2016) Exploring the Limits of Language Modeling
- TLDR: Introduces a mega language model system using deep learning that uses a variety of techniques and significantly performs the SOTA on the One Billion Words Benchmark.
 
- (2018) Deep contextualized word representations π‘ πΌ
- TLDR: This paper introduces ELMO, a super powerful collection of word embeddings learned from the intermediate representations of a deep bidirectional LSTM language model. Achieved SOTA on 6 diverse NLP tasks.
 
- (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding π‘
- TLDR: One of the most important papers of 2018, introducing BERT a powerful architecture pretrained using language modelling which is then effectively transferred to other domain-specific tasks.
 
- (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding π‘
- TLDR: Generalized autoregressive pretraining method that improves upon BERT by maximizing the expected likelihood over all permutations of the factorization order.
 
- (1997) Long Short-Term Memory π‘
- TLDR: Introduces the LSTM recurrent unit, a cornerstone of modern neural network-based NLP
 
- (2000) Maximum Entropy Markov Models for Information Extraction and Segmentation π‘
- TLDR: Introduces Markov Entropy Markov models for information extraction, a commonly used ML technique in classical NLP.
 
- (2010) From Frequency to Meaning: Vector Space Models of Semantics
- TLDR: A wonderful survey of existing vector space models for learning semantics in text.
 
- (2012) An Introduction to Conditional Random Fields
- TLDR: A nice, in-depth overview of conditional random fields, a commonly-used sequence-labelling model.
 
- (2013) Distributed Representation of Words and Phrases and Their Compositionality
- TLDR Introduced word2vec, a collection of distributed vector representations that have been commonly used for initializing word embeddings in basically every NLP architecture of the last five years. π‘ πΌ
 
- (2014) Glove: Global vectors for word representation π‘ πΌ
- TLDR: Introduces Glove word embeddings, one of the most commonly used pretrained word embedding techniques across all flavors of NLP models
 
- (2014) Donβt count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
- TLDR: Important paper demonstrating that context-predicting distributional semantics approaches outperform count-based techniques.
 
- (2015) Improving Distributional Similarity with Lessons Learned From Word Embeddings π‘
- TLDR: Demonstrates that traditional distributional semantics techniques can be enhanced with certain design choices and hyperparameter optimizations that make their performance rival that of neural network-based embedding methods.
 
- (2018) Universal Language Model Fine-tuning for Text Classification
- TLDR: Provides a smorgasbord of nice techniques for finetuning language models that can be effectively transferred to text classification tasks.
 
- (2019) Analogies Explained: Towards Understanding Word Embeddings
- TLDR: Very nice work providing a mathematical formalism for understanding some of the paraphrasing properties of modern word embeddings.