Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594

patrickvonplaten · 2020-08-19T12:36:56Z

This PR adds the models from the following paper:

Paper: https://arxiv.org/pdf/1907.12461.pdf

The paper does a great job at showing how pretrained BERT & RoBERTa model can be leveraged for Seq2Seq tasks and yields good results on many seq2seq tasks. It's fits very well with the current implementation of the EncoderDecoder framework.

This PR adds code to port all pretrained encoder decoder models that can be found here: https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder,

which can be found here: https://huggingface.co/models?search=google%2Froberta
and here: https://huggingface.co/models?search=google%2Fbert2

An example of how a model can be used is here:
https://huggingface.co/google/roberta2roberta_L-24_bbc

Big thanks to @shashiongithub for providing me with the tokenizer files and giving valuable insights on setting the correct generation parameters!

LysandreJik · 2020-09-07T18:55:54Z

If I understand correctly, the BERT model used here is slightly different because:

It doesn't use token type IDs
It's tying its word embedding layer to its LM head
No pooling layer

Doesn't that just mean we could use an additional architecture instead of an entire model class? Something like the following, in modeling_bert.py:

@add_start_docstrings(
    """Bert Model with a `language modeling` head on top that acts as a decoder in a seq2seq setting.""", BERT_START_DOCSTRING
)
class CausalBertModel(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        if not config.is_decoder:
            logger.warning("If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`")

        self.bert = BertModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        self.init_weights()

And this module wouldn't accept token type IDs as an input.

I don't know what to do regarding the tokenizer though. This ^ approach could probably leverage @julien-c's #6995

sshleifer · 2020-09-07T20:52:56Z

Naming ideas: BertFor{Conditional}Generation, BertEncoder, BertDecoder.

Reasoning:

CausalBert doesn't make sense if the encoder wasn't trained with a causal mask.
I think in class naming it's more important to give someone a sense of how to use something than how that thing was trained, but that's not an opinion I hold strongly.

Anyways, your signatures look super clean, easy and consistent!
Excited to try these out+happy to help check metrics.

patrickvonplaten · 2020-09-07T22:44:10Z

@LysandreJik,

There are a couple of problems with that:

I also need a different BertEmbeddings or manually set self.token_type_embeddings to a zero matrix. Even if token_type_ids is set to None in Bert, the self.token_type_embeddings is always used. This model just does not have the embeddings (and should not have them IMO). I could set the self.token_type_embedding matrix just to 0, but then people using this class for training would not realize that a self.token_type_embedding matrix is trained which it shouldn't. So, here I think either way, I will need a separete BertEmbeddings class.
A bigger problem is the config class. Because I need both the new CausalBertForCausalLM and BertLMHeadModel in the AUTO_MODELS_FOR_CAUSAL_LM class (to leverage both models with the EncoderDecoder framework), the two models have to have different config classes. I guess we could also create a separate config class and overwrite the inherited config class from BertPretrainedModel, but then IMO, it's cleaner to just create a new PretrainedModelClass and in this case we can directly create a completely new model class

So overall, it seems to me that a separate model class is the cleaner way to go - what do you think?

@sshleifer - very much agree here! Think the naming should be different...BertEncoder is already taken though. I could go for BertForGenerationEncoder and BertForGenerationDecoder and BertForGenerationConfig - No need for BertForConditionalGeneration as the `EncoderDecoderModel will be used for this

sshleifer · 2020-09-08T13:24:11Z

BertForGenerationEncoder and BertForGenerationDecoder and BertForGenerationConfig 👍

I do see lysandre's point though and would be fine with you setting token_type matrix to 0 if it's small (which I think it is).

sgugger

This looks great to me, renaming apart. Since the names have been in a release already, I think we need proper deprecation warnings before removing those old names.

sgugger · 2020-09-08T13:44:37Z

src/transformers/__init__.py

@@ -22,7 +22,7 @@
 from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig
 from .configuration_bart import BartConfig
 from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
-from .configuration_causal_bert import CausalBertConfig


This class has been in a release already. We can't remove it without proper deprecation warnings.

This was a weird diff -> the config, tokenizer and model CausalBert... were never in the library - I added them yesterday to the PR. If you look at the changed files now you can see that no previous model names are removed :-)

Will still need 1,2 days to fiinish the PR including integration tests and model cards, etc...so no need to review yet :-)

Oh then in that case, no problem with renaming things :-)

sgugger · 2020-09-08T13:45:11Z

src/transformers/__init__.py

@@ -418,9 +418,9 @@
        TransfoXLPreTrainedModel,
        load_tf_weights_in_transfo_xl,
    )
-    from .modeling_causal_bert import (
-        CausalBertModel,


Same for those names.

sgugger · 2020-09-08T13:47:08Z

src/transformers/__init__.py

@@ -144,7 +144,7 @@
 from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
 from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
 from .tokenization_camembert import CamembertTokenizer
-from .tokenization_causal_bert import CausalBertTokenizer
+from .tokenization_bert import BertForSeqGenerationTokenizer


patrickvonplaten · 2020-09-08T16:52:47Z

summarization models seem to function and are uploaded here:

https://huggingface.co/models?search=google%2Froberta2roberta

patrickvonplaten · 2020-09-09T18:04:12Z

src/transformers/generation_utils.py

+                elif (
+                    hasattr(self.config, "decoder")
+                    and hasattr(self.config.decoder, "bos_token_id")
+                    and self.config.decoder.bos_token_id is not None


need one for more check for this

(out of scope)
I would be down for a helper method
determine_decoder_start_token_id to get this out of the main block.

src/transformers/tokenization_auto.py

patrickvonplaten · 2020-09-09T18:06:30Z

src/transformers/tokenization_camembert.py

@@ -22,7 +22,6 @@
 import sentencepiece as spm

 from .tokenization_utils import PreTrainedTokenizer
-from .tokenization_xlnet import SPIECE_UNDERLINE


small clean-up

src/transformers/tokenization_t5.py

patrickvonplaten · 2020-09-09T18:33:17Z

UPDATE: PR is ready for review @sshleifer @LysandreJik @sgugger .

Would be awesome if you could take a look

codecov · 2020-09-09T18:40:06Z

Codecov Report

Merging #6594 into master will increase coverage by 2.11%.
The diff coverage is 75.76%.

@@            Coverage Diff             @@
##           master    #6594      +/-   ##
==========================================
+ Coverage   78.37%   80.49%   +2.11%     
==========================================
  Files         164      167       +3     
  Lines       31026    31314     +288     
==========================================
+ Hits        24318    25207     +889     
+ Misses       6708     6107     -601

Impacted Files	Coverage Δ
src/transformers/tokenization_t5.py	`95.23% <ø> (-0.05%)`	⬇️
src/transformers/tokenization_auto.py	`91.52% <40.00%> (-4.78%)`	⬇️
src/transformers/modeling_encoder_decoder.py	`88.78% <50.00%> (-3.22%)`	⬇️
src/transformers/modeling_bert_generation.py	`69.19% <69.19%> (ø)`
src/transformers/tokenization_bert_generation.py	`94.64% <94.64%> (ø)`
src/transformers/__init__.py	`99.33% <100.00%> (+0.01%)`	⬆️
src/transformers/configuration_auto.py	`93.61% <100.00%> (+0.13%)`	⬆️
src/transformers/configuration_bert_generation.py	`100.00% <100.00%> (ø)`
src/transformers/file_utils.py	`82.41% <100.00%> (-0.26%)`	⬇️
src/transformers/generation_utils.py	`96.92% <100.00%> (-0.28%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15478c1...aa953cb. Read the comment docs.

sshleifer

Nice!

sshleifer · 2020-09-09T18:43:51Z

docs/source/model_doc/bertforseqgeneration.rst

+
+*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.*
+
+Tips:


I would make sure a usage example is before Tips (or right after)

src/transformers/configuration_bert_for_seq_generation.py

sshleifer · 2020-09-09T18:47:16Z

src/transformers/generation_utils.py

+                elif (
+                    hasattr(self.config, "decoder")
+                    and hasattr(self.config.decoder, "bos_token_id")
+                    and self.config.decoder.bos_token_id is not None


(out of scope)
I would be down for a helper method
determine_decoder_start_token_id to get this out of the main block.

src/transformers/modeling_bert_for_seq_generation.py

sshleifer · 2020-09-09T18:49:08Z

docs/source/model_doc/bertforseqgeneration.rst

@@ -0,0 +1,44 @@
+BertForSeqGeneration


dont see why Seq should be in the name. What other kind of generation might a confused person be thinking of?
dont feel strongly.

changed it to BertGeneration

src/transformers/modeling_bert_for_seq_generation.py

src/transformers/tokenization_auto.py

tests/test_modeling_bert_for_seq_generation.py

sshleifer · 2020-09-09T18:56:42Z

tests/test_modeling_encoder_decoder.py

+        }
+
+    @slow
+    def test_roberta2roberta_summarization(self):


does generation with model.half() work?

maybe not sure - will check

sgugger

Looks great to me! I mostly have annoying nits about the docs, cause I'm an annoying person.

docs/source/model_doc/bertforseqgeneration.rst

src/transformers/configuration_bert_for_seq_generation.py

src/transformers/tokenization_bert_for_seq_generation.py

utils/check_repo.py

patrickvonplaten · 2020-09-10T10:44:32Z

Looks great to me! I mostly have annoying nits about the docs, cause I'm an annoying person.

Haha, no you are 100% right - sorry for being so sloppy with the docs! I should have learnt it by now ....

patrickvonplaten · 2020-09-10T11:02:59Z

@sshleifer @sgugger - thanks a lot for your suggestions. I went for the name BertGenerationEncoder and BertGenerationDecoder now. I think it's the best trade-off between short and concise name that is not confusing.

LysandreJik

Great, very cool!!

src/transformers/configuration_bert_generation.py

src/transformers/modeling_bert_generation.py

src/transformers/tokenization_auto.py

djstrong · 2020-09-24T09:14:46Z

Have the "share" models been implemented? In the paper, in many tasks they achieve the best results.

patrickvonplaten · 2020-09-24T09:36:13Z

Yes you can find then under google/roberta2roberta

djstrong · 2020-09-24T09:43:26Z

Thank you. How to tie weights in the code for training own model?

patrickvonplaten · 2020-09-24T10:16:34Z

tie_encoder_decoder=True -> The code in this model card should show you how to do it :-) https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16

huggingface#6594) * add conversion script * improve conversion script * make style * add tryout files * fix * update * add causal bert * better names * add tokenizer file as well * finish causal_bert * fix small bugs * improve generate * change naming * renaming * renaming * renaming * remove leftover files * clean files * add fix tokenizer * finalize * correct slow test * update docs * small fixes * fix link * adapt check repo * apply sams and sylvains recommendations * fix import * implement Lysandres recommendations * fix logger warn

patrickvonplaten changed the title ~~[Seq2Seq] Port google EncoderDecoder pretrained checkpoint models into EncoderDecoder framework~~ [WIP, Seq2Seq] Port google EncoderDecoder pretrained checkpoint models into EncoderDecoder framework Aug 19, 2020

patrickvonplaten force-pushed the add_seq2seq_tf_hub_conversion_script branch from cf1f3fc to 89538ff Compare September 7, 2020 14:43

sgugger approved these changes Sep 8, 2020

View reviewed changes

patrickvonplaten added 20 commits September 9, 2020 15:53

add conversion script

f6ae66f

improve conversion script

525f6db

make style

ce2af43

add tryout files

213353d

fix

c5b7efd

update

15fea8e

add causal bert

da477b6

better names

749e96c

add tokenizer file as well

06f1517

finish causal_bert

74e38e3

fix small bugs

4c5340e

improve generate

8f2e8c3

change naming

bf4f425

renaming

1b9a716

renaming

355b8a5

renaming

7d0ea85

remove leftover files

738320e

clean files

a33d06c

add fix tokenizer

b15c962

finalize

a6392eb

patrickvonplaten force-pushed the add_seq2seq_tf_hub_conversion_script branch from 1b1a2c8 to a6392eb Compare September 9, 2020 17:40

correct slow test

db06984

update docs

ed36280

patrickvonplaten commented Sep 9, 2020

View reviewed changes

src/transformers/tokenization_auto.py Show resolved Hide resolved

patrickvonplaten commented Sep 9, 2020

View reviewed changes

src/transformers/tokenization_t5.py Show resolved Hide resolved

patrickvonplaten added 3 commits September 9, 2020 20:15

small fixes

e88e005

fix link

e782bf8

adapt check repo

550beed

patrickvonplaten changed the title ~~[WIP, Seq2Seq] Port google EncoderDecoder pretrained checkpoint models into EncoderDecoder framework~~ Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. Sep 9, 2020

sshleifer approved these changes Sep 9, 2020

View reviewed changes

sgugger approved these changes Sep 9, 2020

View reviewed changes

apply sams and sylvains recommendations

85e1294

fix import

d69d0da

patrickvonplaten requested a review from LysandreJik September 10, 2020 12:39

LysandreJik approved these changes Sep 10, 2020

View reviewed changes

patrickvonplaten added 2 commits September 10, 2020 16:11

implement Lysandres recommendations

d432196

fix logger warn

aa953cb

patrickvonplaten merged commit 7fd1feb into huggingface:master Sep 10, 2020


		Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.

		Tips:

Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594

Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594

Uh oh!

Conversation

patrickvonplaten commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik commented Sep 7, 2020

Uh oh!

sshleifer commented Sep 7, 2020

Uh oh!

patrickvonplaten commented Sep 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sshleifer commented Sep 8, 2020

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickvonplaten commented Sep 9, 2020

Uh oh!

codecov bot commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sshleifer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten commented Aug 19, 2020 •

edited

Loading

patrickvonplaten commented Sep 7, 2020 •

edited

Loading

patrickvonplaten commented Sep 8, 2020 •

edited

Loading

codecov bot commented Sep 9, 2020 •

edited

Loading

djstrong commented Sep 24, 2020 •

edited

Loading