-
code
-
NLU tasks
- textual entailment/ question answering/ semantic similarity assessmen/document classification
-
Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.
-
large gains on these tasks can be realized
-
by generative pre-training of a language model on a diverse corpus of unlabeled text,
-
followed by discriminative fine-tuning on each specific task.
pre-training fine-tuning characteristic generative discriminative problem a language model each specific task data diverse corpus of unlabeled text labeled data for learning these specific tasks learning type unsupervised supervised
-
-
make use of task-aware input transformations during fine-tuning to achieve effective transfer
- while requiring minimal changes to the model architecture.
-
general task-agnostic model outperforms
discriminatively trained models(use architectures specifically crafted for each task)- significantly improving upon the state of the art in 9 out of the 12 tasks studied.
- achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI)
-
Our training procedure consists of two stages
-
learning a high-capacity language model on a large corpus of text.
-
followed by a fine-tuning stage,
- adapt the model to a discriminative task with labeled data.
-
-
corpus
dataset notation $\mathcal{U}$ an unsupervised corpus of tokens $\mathcal{U} = {u_1, . . . , u_n}$ $\mathcal{C}$ a labeled dataset each instance consists of a sequence of input tokens, $x^1, . . . , x^m$ , along with a label$y$ ![image-20201016191556484](/Users/csg/Library/Application Support/typora-user-images/image-20201016191556484.png)
![image-20201016192941884](/Users/csg/Library/Application Support/typora-user-images/image-20201016192941884.png)
Unsupervised pre-traning | Supervised fine-tuning | |
---|---|---|
corpus | an unsupervised corpus of tokens |
a **labeled dataset **$C$ (each instance consists of a sequence of input tokens, |
objective function | $L_2(\mathcal{C}) = \sum_{(x,y)} log P(y | |
model |
|
$P(y |
input | contiguous sequences of text | convert structured inputs into an ordered sequence |
-
Given an unsupervised corpus of tokens
$\mathcal{U} = {u_1, . . . , u_n}$ ,-
use a standard language modeling objective to maximize the following likelihood: $$ L_1(\mathcal{U}) = \sum_{i} log P(u_i |u_{i−k}, . . . , u_{i−1};Θ) $$
-
$k$ : size of the context window, -
$P$ : conditional probability- is modeled using a neural network with parameters
$Θ$ . - These parameters are trained using stochastic gradient descent [51].
- is modeled using a neural network with parameters
-
-
use a multi-layer Transformer decoder [34] for the language model,
- is a variant of the transformer [62].
- a multi-headed self-attention operation over the input context tokens
- followed by position-wise feedforward layers to produce an output distribution over target tokens:
$$ h_0 = UW_e + W_p $$
$$ h_l = transformer_block(h_{l−1})∀i ∈ [1, n] $$
$$ P(u) = softmax(h_nW^T_e ) $$
-
$U = (u_k, . . . , u_1)$ : context vector of tokens -
$n$ : the number of layers -
$W_e$ : token embedding matrix -
$Wp$ : position embedding matrix
- is a variant of the transformer [62].
-
-
After training the model with the objective in Eq. 1,
- adapt the parameters to the supervised target task.
-
a labeled dataset
$C$ - each instance consists of a sequence of input tokens,
$x^1, . . . , x^m$ , along with a label$y$ .
- each instance consists of a sequence of input tokens,
-
The inputs
- passed through our pre-trained model to obtain the final transformer block’s activation
$h_m^l$ , - is then fed into an added linear output layer with parameters
$Wy$ to predict$y$ :
$$ P(y|x^1, . . . , x^m) = softmax(h_m^lWy) $$
- passed through our pre-trained model to obtain the final transformer block’s activation
-
This gives us the following objective to maximize: $$ L_2(\mathcal{C}) = \sum_{(x,y)} log P(y|x^1, . . . , x^m) $$
-
including language modeling as an auxiliary objective to the fine-tuning helped learning by
-
(a) improving generalization of the supervised model,
-
(b) accelerating convergence.
-
we optimize the following objective (with weight λ): $$ L_3(\mathcal{C}) = L_2(\mathcal{C}) + λ ∗ L_1(\mathcal{C}) $$
-
- task input
- some tasks(text classification) can directly fine-tune our model as described above.
- other tasks(question answering or textual entailment) have structured inputs
- such as ordered sentence pairs, or triplets of document, question, and answers.
- Previous work
- proposed learning task specific architectures on top of transferred representations [44].
- Such an approach re-introduces a significant amount of task-specific customization
- does not use transfer learning for these additional architectural components.
- use a traversal-style approach [52],
- convert structured inputs into an ordered sequence that our pre-trained model can process.
- All transformations include adding randomly initialized start and end tokens (
)
input transformations | note | |
---|---|---|
Textual entailment |
concatenate the premise |
|
Similarity | modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations |
there is no inherent ordering of the two sentences being compared. |
Question Answering & Commonsense Reasoning | concatenate the document context and question with each possible answer, adding a delimiter token in between to get |
given a context document |
step | task | Dataset |
---|---|---|
training LM | BooksCorpus dataset [71] | |
task | Natural language inference | SNLI [5], MultiNLI [66], Question NLI [64], RTE [4], SciTail [25] |
Question Answering | RACE [30], Story Cloze [40] | |
Sentence similarity | MSR Paraphrase Corpus [14], Quora Question Pairs [9], STS Benchmark [6] | |
Classification | Stanford Sentiment Treebank-2 [54], CoLA [65] |
- BooksCorpus dataset [71]
- It contains over 7,000 unique unpublished books from a variety of genres(Adventure, Fantasy, Romance)
- An alternative dataset, the 1B Word Benchmark, is approximately the same size but is shuffled at a sentence level - destroying long-range structure.
detail | |
---|---|
model spec | - a 12-layer decoder-only transformer - masked self-attention heads (768 dimensional states and 12 attention heads) - Adam optimization - a max learning rate of 2.5e-4 - learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule - 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. - a simple weight initialization of N(0, 0.02) - a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization - a modified version of L2 regularization with w = 0.01 on all non bias or gain weights - Gaussian Error Linear Unit (GELU) - learned position embeddings |
fine-tuning | - reuse the hyperparameter settings from unsupervised pre-training. - add dropout to the classifier with a rate of 0.1. - For most tasks, a learning rate of 6.25e-5 and a batchsize of 32. - 3 epochs of training was sufficient for most cases. - use a linear learning rate decay schedule with warmup over 0.2% of training. - λ was set to 0.5 |
- use the ftfy library2
- to clean the raw text in BooksCorpus, standardize some punctuation and whitespace
- use the spaCy tokenizer.3
[출처]
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
[고려대 강필성 교수님 강의](