You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<tdstyle="text-align:center">Figure 1: RNN-based encoder-decoder architecture <ahref="https://arxiv.org/abs/1409.3215"><sup>[1]</sup></a> <ahref="https://arxiv.org/abs/1409.0473"><sup>[2]</sup></a><br><br>Left: without attention mechanism | Right: with attention mechism</td>
39
+
</tr>
40
+
</tbody>
41
+
</table>
42
+
</div>
33
43
34
44
In 2017, Vaswani et al. published a paper [Attention is all you need](https://arxiv.org/abs/1706.03762) which introduced a new model architecture called `Transformer`. It still consists of an encoder and a decoder, however instead of using RNN/LSTM for the components, they use multi-head self-attention as the building blocks. This innovate attention mechanism becomes the fundamental of the breakthroughs in NLP since then, beyond the NMT tasks.
Combined with the idea of pretraining and transfer learning (for example, from [ULMFiT](https://arxiv.org/abs/1801.06146)), a golden age of NLP started in 2018-2019 with the release of OpenAI's [GPT](https://openai.com/blog/language-unsupervised/) and [GPT-2](https://openai.com/blog/better-language-models/) models and Google's [BERT](https://arxiv.org/abs/1810.04805) model. It's now common to call them Transformer models, however they are not encoder-decoder architecture as the original Transformer: BERT is encoder-only (originally for text classification) and GPT models are decoder-only (for text auto-completion).
43
63
44
64
The above models and their variations focus on pretraining either the encoder or the decoder only. The [BART](https://arxiv.org/abs/1910.13461) model is one example of a standalone encoder-decoder Transformer model adopting sequence-to-sequence pretraining method, which can be used for document summarization, question answering and machine translation tasks directly.[<sup>1</sup>](#fn1) The [T5](https://arxiv.org/abs/1910.10683) model converts all text-based NLP problems into a text-to-text format, and use the Transformer encoder-decoder to tackle all of them. During pretraining, these models are trained from scratch: their encoder and decoder models are initialized with random weights.
<tdstyle="text-align:center">Figure 3: The 3 pretraining paradigms for Transformer models <ahref="https://arxiv.org/abs/1810.04805"><sup>[4]</sup></a> <ahref="https://openai.com/blog/language-unsupervised/"><sup>[5]</sup></a> <ahref="https://arxiv.org/abs/1910.13461"><sup>[6]</sup></a></td>
77
+
</tr>
78
+
</tbody>
79
+
</table>
80
+
</div>
81
+
52
82
In 2020, the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) studied the effectiveness of initializing sequence-to-sequence models with pretrained encoder/decoder checkpoints for sequence generation tasks. It obtained new state-of-the-art results on machine translation, text summarization, etc.
53
83
54
84
Following this idea, 🤗 [transformers](https://huggingface.co/docs/transformers/index) implements [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) that allows users to easily combine almost any 🤗 pretrained encoder (Bert, Robert, etc.) with a 🤗 pretrained decoder (GPT models, decoder from Bart or T5, etc.) to perform fine-tuning on downstream tasks. Instantiate a [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) is super easy, and finetune it on a sequence-to-sequence task usually obtains descent results in just a few hours on Google Cloud TPU.
@@ -151,9 +181,19 @@ The obtained sequence of vectors plays the same role as token embeddings in [BER
0 commit comments