This file is no longer actively maintained. If you are interested in maintaining/updating it, feel free to update by raising PRs or by reaching out to jsaimurali001 [at] gmail [dot] com
- Natural Language Processing (almost) from Scratch, Collobert et al. 2011
- Word2Vec, Efficient Estimation of Word Representations in Vector Space, Mikolov et al. 2013a
- Word2Vec, Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al. 2013b
💡 see Ruder's 3 parts of explanation-p1, p2, p3 along with his Aylien blog, Chris McCormick's take on Negative Sampling here along with resources to reimplement, see here for backprop derivations in word2vec and here to download pretrained embeddings
- https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html(G Lample et al. 2016)
- ELMo, Deep contextualized word representations, Peters et al. 2018
- FLAIR, Contextual String Embeddings for Sequence Labeling, Akbik et al. 2018 [CODE]
- Character-Level Language Modeling with Deeper Self-Attention, Rami et al. 2018
- FastText, Enriching Word Vectors with Subword Information, Bojanowski et al. 2016
- Neural Machine Translation of Rare Words with Subword Units, Sennrich et al. 2015 [also see this and this]
💡 de-biasing, robust to spelling errors, etc.
- Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network, Sakaguchi et al. 2016
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, Bolukbasi et al. 2016
- Learning Gender-Neutral Word Embeddings, Zhao et al. 2018
- Combating Adversarial Misspellings with Robust Word Recognition, Danish et al. 2019
- Misspelling Oblivious Word Embeddings, Edizel et al. 2019 [facebook AI]
- Skip-Thought Vectors, Kiros et al. 2015
- A Structured Self-attentive Sentence Embedding, Lin et al. 2017
- InferSent, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau eta al. 2017
- Hierarchical Attention Networks for Document Classification, Yang et al. 2016
- DisSent: Sentence Representation Learning from Explicit Discourse Relations, Nie et al. 2017
- USE, Universal Sentence Encoder, Cer et al. 2018 [also see Multilingual USE]
- [fasttext embeddings]
- Polyglot: Distributed Word Representations for Multilingual NLP, Rami et al. 2013
- Density Matching for Bilingual Word Embedding, Zhou et al. 2015
- Word Translation Without Parallel Data, Conneua et al. 2017 [repo]
- Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion, Joulin et al. 2018
- Unsupervised Multilingual Word Embeddings, Chen & Cardie 2018 [repo]
💡 NLU and XLU
- GLUECoS: An Evaluation Benchmark for Code-Switched NLP, Khanuja et al. 2020
- XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation, Liang et al. 2020
- XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization, Hu et al. 2020
- SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, Wang et al. 2019
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, Wang et al. 2018 [Site]
- XNLI: Evaluating Cross-lingual Sentence Representations, Conneau eta al. 2018c
- SentEval: An Evaluation Toolkit for Universal Sentence Representations, Conneau et al. 2018a [Site]
- CLUE: Language Understanding Evaluation benchmark for Chinese (CLUE)
💡 inductive bias, distillation and pruning, adversarial attacks, fairness and bias
💡 distillation (can thought of a MAP estimate with prior rather than MLE objective)
- Dissecting Contextual Word Embeddings: Architecture and Representation, Peters et al. 2018b
- What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties, Conneau et al 2018b
- Troubling Trends in Machine Learning Scholarship, Lipton and Steinhardt 2018
- How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks, Kaushik and Lipton 2018
- Are Sixteen Heads Really Better than One?, Paul et al. 2019 [Blogpost]
- No Training Required: Exploring Random Encoders for Sentence Classification, Wieting et al. 2019
- BERT Rediscovers the Classical NLP Pipeline, Tenney et al. 2019
- Compositional Questions Do Not Necessitate Multi-hop Reasoning, Min et al. 2019
- Probing Neural Network Comprehension of Natural Language Arguments, Niven & Kao 2019 and [this] related article
- The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives, Voita et al. 2019
- Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study, Fu et al. 2019
- Attention is not Explanation, Jain and Wallace 2019
- Is Attention Interpretable?, Serrano and Smith 2019
- Attention is not not Explanation, Wiegreffe and Pinter 2019
- Learning to Deceive with Attention-Based Explanations, Pruthi et al. 2020
- Combating Adversarial Misspellings with Robust Word Recognition, Danish et al. 2019
- Universal Adversarial Triggers for Attacking and Analyzing NLP, Wallace et al. 2019
- Weight Poisoning Attacks on Pre-trained Models, Kurita et al. 2020
- Understanding Knowledge Distillation in Non-autoregressive Machine Translation, Zhou et al. 2019
- Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, Tang et al. 2019. Also a related work from HuggingFace here, and work on quantization compression by RASA here
- Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes [also see this article]
- RoBERTa, A Robustly Optimized BERT Pretraining Approach, Liu et al. 2019
- Patient Knowledge Distillation for BERT Model Compression, Sun et al. 2019
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Lan et al. 2019
💡 Similar works are also compiled here: Pre-trained Language Model Papers
💡 Typically, these pre-training methods involve an self-supervised (also called semi-supervised/unsupervised in some works) learning followed by a supervised learning. This is unlike CV domain
where pre-training is mainly supervised learning.
- https://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture13-contextual-representations.pdf
- Semi-supervised Sequence Learning, Dai et al. 2015
- Unsupervised Pretraining for Sequence to Sequence Learning, Ramachandran et al. 2016
- context2vec: Learning Generic Context Embedding with Bidirectional LSTM, Melamud et al. 2016
- InferSent, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau eta al. 2017
- ULM-FiT, Universal Language Model Fine-tuning for Text Classification, Howard and Ruder 2018
- ELMo, Deep contextualized word representations, Peters et al. 2018 [also see previus works- TagLM and CoVe ]
- GPT-1 aka OpenAI Transformer, Improving Language Understanding by Generative Pre-Training, Radford et al. 2018
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018 [SLIDES] [also see Illustrated BERT]
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang et al. 2019
- GPT-2, Language Models are Unsupervised Multitask Learners, Radford et al. 2019 [also see Illustrated GPT-2]
- ERNIE: Enhanced Language Representation with Informative Entities, Zhang et al. 2019
- XLNet: Generalized Autoregressive Pretraining for Language Understanding, Yang et al. 2019
- RoBERTa: A Robustly Optimized BERT Pretraining Approach, Liu et al. 2019
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding, Sun et al. 2019
- CTRL: A Conditional Transformer Language Model for Controllable Generation, Keskar et al. 2019
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Lan et al. 2019
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, Clark et al. 2019 [Google Blog]
💡 Some people went ahead and thought "how about using supervised (+- self-unsupervised) tasks for pretraining?!"
- InferSent, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, Conneau eta al. 2017
- USE, Universal Sentence Encoder, Cer et al. 2018 [also see Multilingual USE]
- Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks, Phang et al. 2018
💡 see Interpretability and Ethics section for more papers
- E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT, Poerner et al. 2020
- A Primer in BERTology: What we know about how BERT works, Rogers et al. 2020
- Comparing BERT against traditional machine learning text classification, Carvajal et al. 2020
- Revisiting Few-sample BERT Fine-tuning, Zhang et al. 2020
- The Evolved Transformer, So et al. 2019
- R-Transformer: Recurrent Neural Network Enhanced Transformer, Wang et al. 2019
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Raffel et al. 2019
- The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives, Voita et al. 2019
- Reformer: The Efficient Transformer, Kitaev et al. 2020
💡 dataset distillation :p
- Deep Active Learning for Named Entity Recognition, shen et al. 2017
- Learning how to Active Learn: A Deep Reinforcement Learning Approach, Fang et al. 2017
- An Ensemble Deep Active Learning Method for Intent Classification, Zhang et al. 2019
- decaNLP, The Natural Language Decathlon: Multitask Learning as Question Answering, McCann et al. 2018
- HMTL, A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks, Victor et al. 2018
- GenSen, Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, Subramanian et al. 2018
- Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling, Wang et al. 2019
- GPT-2, Language Models are Unsupervised Multitask Learners, Radford et al. 2019 [also see Illustrated GPT-2]
- Unified Language Model Pre-training for Natural Language Understanding and Generation, Dong et al. 2019
- MASS: Masked Sequence to Sequence Pre-training for Language Generation, Song et al. 2019
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding, Sun et al. 2019
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Raffel et al. 2019 [code]
- Incorporating Copying Mechanism in Sequence-to-Sequence Learning, Jiatao Gu et al. 2016
- Quantifying Exposure Bias for Neural Language Generation, He et al. 2019
- CTRL: A Conditional Transformer Language Model for Controllable Generation, Keskar et al. 2019
- Plug and Play Language Models: A Simple Approach to Controlled Text Generation, Dathathri et al. 2019
- Zero-shot User Intent Detection via Capsule Neural Networks, Xia et al. 2018
- Investigating Capsule Networks with Dynamic Routing for Text Classification, Zhao et al. 2018
- BERT for Joint Intent Classification and Slot Filling, Chen et al. 2019
- Few-Shot Generalization Across Dialogue Tasks, Vlasov et al. 2019 [RASA Research]
- Towards Open Intent Discovery for Conversational Text, Vedula et al. 2019
- What makes a good conversation? How controllable attributes affect human judgments [also see this article]
- Sequence to Sequence Learning with Neural Networks, Sutskever et al. 2014
- Addressing the Rare Word Problem in Neural Machine Translation, Luong et al. 2014
- Neural Machine Translation of Rare Words with Subword Units, Sennrich et al. 2015
- Transformer, Attention Is All You Need, Vaswami et al. 2017
- Understanding Back-Translation at Scale, Edunov et al. 2018
- Achieving Human Parity on Automatic Chinese to English News Translation, Microsoft Research 2018 [Bites] [also see this and this]
💡 LMs realized as diverse learners; learning more than what you thought!!
- Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond, Artetxe et al. 2018
- How multilingual is Multilingual BERT?, Pires et al.2019
- Multilingual Universal Sentence Encoder (USE) for Semantic Retrieval, Yinfei Yang et al. 2019
- How Language-Neutral is Multilingual BERT?, Libovicky et al. 2020
- Universal Phone Recognition with a Multilingual Allophone System, Le te al. 2020
- http://ruder.io/cross-lingual-embeddings/index.html
- XLM, Cross-lingual Language Model Pretraining, Guillaume and Conneau et al. 2019
- Cross-Lingual Ability of Multilingual BERT: An Empirical Study, karthikeyan et al. 2019
- XQA: A Cross-lingual Open-domain Question Answering Dataset, Liu et al. 2019
- Representation Learning with Contrastive Predictive Coding, Oord et al. 2018
- M-BERT: Injecting Multimodal Information in the BERT Structure, Rahman et al. 2019
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers, Tan and Bansal 2019
- BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, Scialom et al. 2020
- A Deep Neural Network Framework for English Hindi Question Answering
- DrQA, Reading Wikipedia to Answer Open-Domain Questions, Chen et al. 2017
- GoldEn Retriever, Answering Complex Open-domain Questions Through Iterative Query Generation, Qi et al 2019
- BREAK It Down: A Question Understanding Benchmark, Wolfson et al. 2020
- XQA: A Cross-lingual Open-domain Question Answering Dataset, Liu et al. 2019
Byte Pair Encoding (BPE)
is a data compression technique that iteratively replaces the most frequent pair of symbols (originally bytes) in a given dataset with a single unused symbol. In each iteration, the algorithm finds the most frequent (adjacent) pair of symbols, each can be constructed of a single character or a sequence of characters, and merged them to create a new symbol. All occurences of the selected pair are then replaced with the new symbol before the next iteration. Eventually, frequent sequence of characters, up to a whole word, are replaced with a single symbol, until the algorithm reaches the defined number of iterations (50k can be an example figure). During inference, if a word isn’t part of the BPE’s pre-built dictionary, it will be split into subwords that are. Code of BPE can be found here. See Overall Idea blog-post, BPE specific blog-post and BPE Code for more details.
import re
words0 = [" ".join([char for char in word]+["</w>"]) for word in "in the rain in Ukraine".split()]+["i n"]+["<w> i n"]
print(words0)
eword1 = re.escape('i n')
p1 = re.compile(r'(?<!\S)' + eword1 + r'(?!\S)')
words1 = [p1.sub('in',word) for word in words0]
print(words1)
eword2 = re.escape('in </w>')
p2 = re.compile(r'(?<!\S)' + eword2 + r'(?!\S)')
words2 = [p2.sub('in</w>',word) for word in words1]
print(words2)
eword3 = re.escape('a in</w>')
p3 = re.compile(r'(?<!\S)' + eword3 + r'(?!\S)')
words3 = [p3.sub('ain</w>',word) for word in words2]
print(words3)
re
library
re.search(), re.findall(), re.split(), re.sub()
re.escape(), re.compile()
101, Positive and Negative Lookahead/Lookbehind
- Models can be trained on SNLI in two different ways: (i) sentence encoding-based models that explicitly separate the encoding of the individual sentences and (ii) joint methods that allow to use encoding of both sentences (to use cross-features or attention from one sentence to the other).
- How good do ranking algorithms, the ones with pointwise/pairwise/listwise learning paradigms, perform when the no. of test classes at the infernece time grow massively? KG Reasoning using Translational/Bilinear/DL techniques is one important area under consideration.
- While the chosen neural achitecture is important, the techniques used for training the problem objective e.g.Word2Vec or the techniques used while doing loss optimization e.g.OpenAI Transformer play a significant role in both fast as well as a good convergence.
- Commonality between Language Modelling, Machine Translation and Word2vec: All of them have a huge vocabulary size at the output and there is a need to alleviate computing of the huge sized softmax layer! See Ruder's page for a quick-read.
- nlp code examples: nn4nlp-code and nlp-tutorial
- lazynlp for data crawling
- video-list
- fast.ai, d2l.ai, nn4nlp-graham
each link is either a series of blogs from an individual/organization or a conference related link or a MOOC
- nlpprogress, paperswithcode, dair-ai , groundai, lyrn.ai
- BERT-related-papers, awesome-qa, awesome-NLP, awesome-sentence-embedding
- Ruder's Blog-posts; this and this taking on latest trends
Jay Alammar,
LiLian (at OpenAI),
Sebastian Rraschka (at UW-Madison),
Victor Sanh (at Huggingface),
Keita Kurita|ML & NLP Explained,
distill.pub,
Chris McCormick,
Kavita Ganesan, - blog.feedspot.com