Skip to content

Commit c451d9b

Browse files
author
ydshieh
committed
update md
1 parent 0721857 commit c451d9b

File tree

1 file changed

+66
-32
lines changed

1 file changed

+66
-32
lines changed

vision_encoder_decoder_blog.md

Lines changed: 66 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -27,28 +27,58 @@ The encoder-decoder architecture was proposed in 2014, when several papers ([Cho
2727

2828
<a id='figure-1'></a>
2929

30-
| <img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/rnn_encoder_decoder.JPG" alt="drawing" width="550"/> |
31-
|:--:|
32-
| Figure 1: RNN-based encoder-decoder architecture [<sup>[1]</sup>](https://arxiv.org/abs/1409.3215) [<sup>[2]</sup>](https://arxiv.org/abs/1409.0473)<br><br>Left: without attention mechanism &nbsp; \| &nbsp; Right: with attention mechism|
30+
<div align="center">
31+
<table>
32+
<thead><tr>
33+
<th style="text-align:center"><img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/rnn_encoder_decoder.JPG" alt="drawing" width="550"/></th>
34+
</tr>
35+
</thead>
36+
<tbody>
37+
<tr>
38+
<td style="text-align:center">Figure 1: RNN-based encoder-decoder architecture <a href="https://arxiv.org/abs/1409.3215"><sup>[1]</sup></a> <a href="https://arxiv.org/abs/1409.0473"><sup>[2]</sup></a><br><br>Left: without attention mechanism &nbsp; | &nbsp; Right: with attention mechism</td>
39+
</tr>
40+
</tbody>
41+
</table>
42+
</div>
3343

3444
In 2017, Vaswani et al. published a paper [Attention is all you need](https://arxiv.org/abs/1706.03762) which introduced a new model architecture called `Transformer`. It still consists of an encoder and a decoder, however instead of using RNN/LSTM for the components, they use multi-head self-attention as the building blocks. This innovate attention mechanism becomes the fundamental of the breakthroughs in NLP since then, beyond the NMT tasks.
3545

3646
<a id='figure-2'></a>
3747

38-
| <img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/transformer.JPG" alt="drawing" width="250"/> |
39-
|:--:|
40-
| Figure 2: Transformer encoder-decoder architecture [<sup>[3]</sup>](https://arxiv.org/abs/1706.03762)|
48+
<div align="center">
49+
<table>
50+
<thead><tr>
51+
<th style="text-align:center"><img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/transformer.JPG" alt="drawing" width="250"/></th>
52+
</tr>
53+
</thead>
54+
<tbody>
55+
<tr>
56+
<td style="text-align:center">Figure 2: Transformer encoder-decoder architecture <a href="https://arxiv.org/abs/1706.03762"><sup>[3]</sup></a></td>
57+
</tr>
58+
</tbody>
59+
</table>
60+
</div>
4161

4262
Combined with the idea of pretraining and transfer learning (for example, from [ULMFiT](https://arxiv.org/abs/1801.06146)), a golden age of NLP started in 2018-2019 with the release of OpenAI's [GPT](https://openai.com/blog/language-unsupervised/) and [GPT-2](https://openai.com/blog/better-language-models/) models and Google's [BERT](https://arxiv.org/abs/1810.04805) model. It's now common to call them Transformer models, however they are not encoder-decoder architecture as the original Transformer: BERT is encoder-only (originally for text classification) and GPT models are decoder-only (for text auto-completion).
4363

4464
The above models and their variations focus on pretraining either the encoder or the decoder only. The [BART](https://arxiv.org/abs/1910.13461) model is one example of a standalone encoder-decoder Transformer model adopting sequence-to-sequence pretraining method, which can be used for document summarization, question answering and machine translation tasks directly.[<sup>1</sup>](#fn1) The [T5](https://arxiv.org/abs/1910.10683) model converts all text-based NLP problems into a text-to-text format, and use the Transformer encoder-decoder to tackle all of them. During pretraining, these models are trained from scratch: their encoder and decoder models are initialized with random weights.
4565

4666
<a id='figure-3'></a>
4767

48-
| <img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/bert-gpt-bart.JPG" alt="drawing" width="400"/> |
49-
|:--:|
50-
| Figure 3: The 3 pretraining paradigms for Transformer models [<sup>[4]</sup>](https://arxiv.org/abs/1810.04805) [<sup>[5]</sup>](https://openai.com/blog/language-unsupervised/) [<sup>[6]</sup>](https://arxiv.org/abs/1910.13461)|
51-
68+
<div align="center">
69+
<table>
70+
<thead><tr>
71+
<th style="text-align:center"><img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/bert-gpt-bart.JPG" alt="drawing" width="400"/></th>
72+
</tr>
73+
</thead>
74+
<tbody>
75+
<tr>
76+
<td style="text-align:center">Figure 3: The 3 pretraining paradigms for Transformer models <a href="https://arxiv.org/abs/1810.04805"><sup>[4]</sup></a> <a href="https://openai.com/blog/language-unsupervised/"><sup>[5]</sup></a> <a href="https://arxiv.org/abs/1910.13461"><sup>[6]</sup></a></td>
77+
</tr>
78+
</tbody>
79+
</table>
80+
</div>
81+
5282
In 2020, the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) studied the effectiveness of initializing sequence-to-sequence models with pretrained encoder/decoder checkpoints for sequence generation tasks. It obtained new state-of-the-art results on machine translation, text summarization, etc.
5383

5484
Following this idea, 🤗 [transformers](https://huggingface.co/docs/transformers/index) implements [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) that allows users to easily combine almost any 🤗 pretrained encoder (Bert, Robert, etc.) with a 🤗 pretrained decoder (GPT models, decoder from Bart or T5, etc.) to perform fine-tuning on downstream tasks. Instantiate a [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) is super easy, and finetune it on a sequence-to-sequence task usually obtains descent results in just a few hours on Google Cloud TPU.
@@ -151,9 +181,19 @@ The obtained sequence of vectors plays the same role as token embeddings in [BER
151181

152182
<a id='figure-4'></a>
153183

154-
| <img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/bert-vs-vit.JPG" alt="drawing" width="600"/> |
155-
|:--:|
156-
| Figure 4: BERT v.s. ViT |
184+
<div align="center">
185+
<table>
186+
<thead><tr>
187+
<th style="text-align:center"><img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/bert-vs-vit.JPG" alt="drawing" width="600"/></th>
188+
</tr>
189+
</thead>
190+
<tbody>
191+
<tr>
192+
<td style="text-align:center">Figure 4: BERT v.s. ViT</td>
193+
</tr>
194+
</tbody>
195+
</table>
196+
</div>
157197

158198
<sup>2</sup> This is just the concept. The actual implementation uses convolution layers to perform this computation efficiently.
159199

@@ -369,9 +409,19 @@ We have learned the encoder-decoder architecture in NLP and the vision Transform
369409

370410
<a id='figure-5'></a>
371411

372-
| <img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/vision-enc-dec.JPG" alt="drawing" width="800"/> |
373-
|:--:|
374-
| Figure 5: Vision-Encoder-Decoder architecture |
412+
<div align="center">
413+
<table>
414+
<thead><tr>
415+
<th style="text-align:center"><img src="https://raw.githubusercontent.com/ydshieh/notebooks/master/images/vision-enc-dec.JPG" alt="drawing" width="800"/></th>
416+
</tr>
417+
</thead>
418+
<tbody>
419+
<tr>
420+
<td style="text-align:center">Figure 5: Vision-Encoder-Decoder architecture</td>
421+
</tr>
422+
</tbody>
423+
</table>
424+
</div>
375425

376426
### **Vision-Encoder-Decoder in 🤗 transformers**
377427

@@ -567,14 +617,6 @@ display(df[:3].style.set_table_styles([{'selector': 'td', 'props': props}, {'sel
567617

568618

569619

570-
<style type="text/css">
571-
#T_800ac_ td {
572-
border: 2px solid black;
573-
}
574-
#T_800ac_ th {
575-
border: 2px solid black;
576-
}
577-
</style>
578620
<table id="T_800ac_" class="dataframe">
579621
<thead>
580622
<tr>
@@ -659,14 +701,6 @@ display(df[3:].style.set_table_styles([{'selector': 'td', 'props': props}, {'sel
659701

660702

661703

662-
<style type="text/css">
663-
#T_5456e_ td {
664-
border: 2px solid black;
665-
}
666-
#T_5456e_ th {
667-
border: 2px solid black;
668-
}
669-
</style>
670704
<table id="T_5456e_" class="dataframe">
671705
<thead>
672706
<tr>

0 commit comments

Comments
 (0)