Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge bart generate into default generate #3140

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Mar 5, 2020

This PR is a first version of how the bart generation() code can be merged into the "default" generation function. I think it's actually much better feasible than we originally thought.
Please not that this change also includes all the changes from PR #3135 , so the code changes will be much less cluttered after #3135 merged.

This version was passes the general random language generation tests in found in test_modeling_common.py and the easy integration test with the original fairseq model (renamed to test_cnn_summarization_same_as_fairseq_easy in test_modeling_bart)

There are a couple of things we should discuss:

  1. Both Bart generate() and default generate(), encoder-decoder models must have a BOS token and an EOS token.

  2. Two new parameters were added: min_length and no_repeat_ngram_size . I think these parameters should be added generally as it is done now.

  3. There was one hack which initializes the decoder_input_ids to the EOS token and then forces the model to generate the BOS token afterwards (see comment in code line). I changed it to simply start with the BOS token (which makes more sense) and it also passed the "easy integration tests". This hack might be needed to pass the hard integration test though.

  4. Fairseq forces the last token of all beam hypotheses to be the EOS token (see comment in line). This is probably necessary to pass the integration tests. It's up for debate whether this the correct way. I would prefer not to do it this way because it will then be impossible to generate unfinished sentences (sentence that end because they hit max_length). If one really wants all beam hypotheses to be finished, one could set the max_length higher than usual and set the parameter:
    self.early_stopping in the Beam Hypotheses class to True. Up for debate how to handle this.

  5. In order to also pass the hard integration tests (which has a padded batch as an input), we will have to add attention_masks to the generate() function. Here I see three possibilities:
    a) add the attention_mask as a parameter to the generation() function.
    b) automatically calculate the attention_mask from the input_ids if the model has a pad_token_id and there is a pad_token_id in the input_ids.
    c) Not allow padded batches for the moment.
    I would prefer option b) because some models do have a set pad_token_id (such as Bart) so we should be able to allow padded generation.

@sshleifer
Copy link
Contributor

clarification:

bart.generate doesn't add EOS if max_length is hit, or require EOS to pass integration tests. It just "finalizes" a hypothesis when the model predicts EOS for a beam.

Example:

ARTICLE_TO_SUMMARIZE = "code doesnt generate EOS if max_length is hit"
inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], return_tensors='pt')

generated_ids = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], num_beams=4, max_length=5)
summary_text = tokenizer.decode(generated_ids[0])
print(generated_ids[0], summary_text)
# (tensor([   0, 2387,  964,   32, 3035]), '<s>My friends are cool')

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good, just a fix on the config.

Regarding the question of the attention mask, here is my (more general) take on it:

In the library, we've initially decided not to mix models and tokenizer and as thus, ask the users to compute the attention mask outside of the forward pass (currently provided by the tokenizers).

Looking back on it, I think it was a good idea to split the two building blocks from the point of view of allowing the user to better understand what's happening and organize his workflow. On the other hand, the very specific case of designing an attention mask has become a typical silent error for many people using the models: they pad the input in batches and then forget to feed the attention mask to the model forward pass.

As we start having a common configuration file, the padding token index is now available in the model and I think we could propose a general (non-breaking) evolution of this behavior for the users, to reduce this source of silent mistake.

I would thus like to propose the following workflow for the forward pass of all models:

  • if an attention_mask is provided, we use it (like now)
  • if no attention_mask is provided but we have a padding_token_id in the configuration, we build a specific attention_mask for the input from the padding token (new behavior, could be breaking in some hacky usages of the lib where a padding token is used for something other than padding...)
  • if no attention_mask is provided and we don't have a padding_token_id in the configuration, we build a generic full attention attention_mask (like we did before).

What do you think especially @LysandreJik and @julien-c

@@ -806,37 +816,66 @@ def generate(
effective_batch_size * num_beams, input_ids_len
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)

# TODO (PVP): probably not the best way to check whether model is encoder decoder
is_encoder_decoder = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this something to clean, let's add a is_encoder_decoder attribute in the configuration of Bart and T5, ok?
cc @sshleifer

We should clean up the T5 implementation also, this is_decoder flag should not be in the config but should be an argument to the __init__ method:

self.is_decoder = config.is_decoder

src/transformers/configuration_utils.py Outdated Show resolved Hide resolved
@patrickvonplaten patrickvonplaten force-pushed the merge_bart_generate_into_default_generate branch from 87a78f6 to c4da5cc Compare March 6, 2020 13:39
@julien-c
Copy link
Member

julien-c commented Mar 6, 2020

I would thus like to propose the following workflow for the forward pass of all models:
[...]
What do you think especially @LysandreJik and @julien-c

Sounds good to me

@thomwolf
Copy link
Member

thomwolf commented Mar 6, 2020

By the way, my workflow proposition actually implied that we should use the same workflow and inputs for the generate() method as well (I could have been more explicit)

Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to give it a deeper read before you merge, Ping me!

eos_token_id,
# bos_token_id,
# eos_token_id, # Why eos_token_id here? bos_token_id seems to work as well ... to see if it works as well with hard summarization case
eos_token_id, # TODO (PVP): to check if this is the only solution -> quite hacky to do this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairseq way is even crazier! look for prefix_tokens in sequence_generator.py

Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Mar 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, not sure whether all encoder-decoder models should start with eos_token_id though. Happy to discuss! @thomwolf @sshleifer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the encoder-decoder output sequence start with eos_token_id.
This seems both weird and probably not generalizable to models behind Bart to me.
Can you explain this better for me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(read the prefix tokens comment before this one)
My understanding is that since we know we are going to generate BOS first, we want the score associated with BOS to be P(BOS | EOS), because that seems reasonable. So the first step, even though we will force BOS, we send EOS as input_ids to get the appropriate score.

tests/test_modeling_bart.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Mar 7, 2020

Codecov Report

Merging #3140 into master will decrease coverage by 0.15%.
The diff coverage is 74.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3140      +/-   ##
==========================================
- Coverage   78.14%   77.99%   -0.16%     
==========================================
  Files          98       98              
  Lines       16668    16665       -3     
==========================================
- Hits        13026    12998      -28     
- Misses       3642     3667      +25
Impacted Files Coverage Δ
src/transformers/configuration_bart.py 100% <ø> (ø) ⬆️
src/transformers/configuration_utils.py 96.82% <100%> (+0.07%) ⬆️
src/transformers/modeling_bart.py 96.27% <100%> (+2.69%) ⬆️
src/transformers/configuration_t5.py 96.55% <100%> (ø) ⬆️
src/transformers/modeling_tf_utils.py 87.47% <46.96%> (-6.27%) ⬇️
src/transformers/modeling_utils.py 93.84% <90.32%> (-0.57%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6de642...bc9d5d9. Read the comment docs.

scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size)
if (
self.config.is_encoder_decoder and do_sample is False
): # TODO(PVP) to be refactored later - do we need this boolean flag here? Also Only add for beam_search or also for no_beam_search? The prepare scores fn is ugly here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very hacky here - don't like the if statement and the prepare_scores_for_generation fn too much. Better ideas? @sshleifer @thomwolf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I don't understand why we need the prepare_scores_for_generation method (cf. my comment on it).
Let's talk about it @sshleifer @patrickvonplaten

@patrickvonplaten
Copy link
Contributor Author

UPDATE: This version passes all integration tests now. There are two things which I quite hacky we probably should implement in a cleaner way:

  • the function prepare_scores_for_generation is a hacky function to make Bart pass the integration tests. See in code-line comment
  • initializing all encoder-decoder models with the EOS token is probably not the right thing to do. See in code-line comment.

@sshleifer @thomwolf @LysandreJik

@patrickvonplaten patrickvonplaten changed the title [Discussion] Merge bart generate into default generate Merge bart generate into default generate Mar 7, 2020
@patrickvonplaten patrickvonplaten force-pushed the merge_bart_generate_into_default_generate branch from a4ba33c to eb3ba73 Compare March 8, 2020 13:26
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it's great to see a convergence here.

As mentioned in the comment there are a couple of method that I don't understand and which seems to be potentially very Bart's specific so I'm not sure we want them in the base class (I don't really understance their use).

Let's discuss them before merging.

src/transformers/modeling_utils.py Show resolved Hide resolved
device=next(self.parameters()).device,
)
cur_len = 1
self.model.decoder.generation_mode = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this generation_mode in Bart which this rather specific to the fairseq way to do it and probably won't be used in other encoder-decoder models.

I'll remove it in my review of Bart if I can. In the meantime for these model specific stuff we can either (1) activate it in the model specific prepare_inputs_for_generation or (2) run behind a if hasattr(self.model, 'generation_mode') test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.


if num_beams > 1:
output = self._generate_beam_search(
input_ids,
cur_len,
max_length,
min_length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these long list of arguments, I think it's safer to have them all called by keyword names:

max_length=max_length,
min_length=min_length,

I know it's longer to write but it's more robust to changes in these methods API later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if you do it, be careful not to introduce bugs in LHS, RHS equality, better to do it will multi-cursors if you can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

eos_token_id,
# bos_token_id,
# eos_token_id, # Why eos_token_id here? bos_token_id seems to work as well ... to see if it works as well with hard summarization case
eos_token_id, # TODO (PVP): to check if this is the only solution -> quite hacky to do this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the encoder-decoder output sequence start with eos_token_id.
This seems both weird and probably not generalizable to models behind Bart to me.
Can you explain this better for me?

src/transformers/modeling_utils.py Show resolved Hide resolved
}

def prepare_scores_for_generation(self, scores, cur_len, max_length):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is really weird. Do we need that?
Can one of you explain this to me? @sshleifer @patrickvonplaten

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairseq generation (for all models) has the following rules:

  1. Never generate pad or unk
  2. if we have hit the max_length constraint, force the model to predict EOS.
  • when I say force, I mean set all other words' scores to -inf, but keep the true probability of the forced token so that the hypothesis is scored accurately.
  1. Optional tensor argument prefix_tokens: for each step i, force the model to predict prefix_tokens[:, step]
  • for bart the prefix tokens are a (batch_size,1) tensor filled with BOS
  • for language_modeling, I believe the prefix tokens are the whole input_ids sequence.

(what they call src_tokens, we call input_ids)

That's what this function is doing, deciding what to force based on the step.

scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size)
if (
self.config.is_encoder_decoder and do_sample is False
): # TODO(PVP) to be refactored later - do we need this boolean flag here? Also Only add for beam_search or also for no_beam_search? The prepare scores fn is ugly here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I don't understand why we need the prepare_scores_for_generation method (cf. my comment on it).
Let's talk about it @sshleifer @patrickvonplaten

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Show resolved Hide resolved
Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super interesting/difficult PR. I left comments explaining some of the tricks.

High Level Points:

  • Nearly all of the tricks are from fairseq's sequence_generator.py which is used for nearly all conditional generation as far as I can tell.
  • I still think it is somewhat better to wait for this refactor until we have more seq2seq callers, but I don't feel strongly and am open to being outvoted.
  • Happy to discuss on any forum!

}

def prepare_scores_for_generation(self, scores, cur_len, max_length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairseq generation (for all models) has the following rules:

  1. Never generate pad or unk
  2. if we have hit the max_length constraint, force the model to predict EOS.
  • when I say force, I mean set all other words' scores to -inf, but keep the true probability of the forced token so that the hypothesis is scored accurately.
  1. Optional tensor argument prefix_tokens: for each step i, force the model to predict prefix_tokens[:, step]
  • for bart the prefix tokens are a (batch_size,1) tensor filled with BOS
  • for language_modeling, I believe the prefix tokens are the whole input_ids sequence.

(what they call src_tokens, we call input_ids)

That's what this function is doing, deciding what to force based on the step.

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Show resolved Hide resolved
eos_token_id,
# bos_token_id,
# eos_token_id, # Why eos_token_id here? bos_token_id seems to work as well ... to see if it works as well with hard summarization case
eos_token_id, # TODO (PVP): to check if this is the only solution -> quite hacky to do this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(read the prefix tokens comment before this one)
My understanding is that since we know we are going to generate BOS first, we want the score associated with BOS to be P(BOS | EOS), because that seems reasonable. So the first step, even though we will force BOS, we send EOS as input_ids to get the appropriate score.

@patrickvonplaten
Copy link
Contributor Author

@thomwolf, @sshleifer Thanks for your reviews guys. I think we are on the same page except for two things:

  1. Force tokens to be generated. The BOS token is generated at step=0 and the EOS token is generated at step=max_length. This is necessary to reproduce the original fairseq results. Should we keep that as a model specific "prepare" function?
  2. Use the EOS token for the Bart models. This is also necessary to reproduce the original fairseq results. Should we keep that?

My opinion is:

  1. I would remove both statements that force a certain token to be predicted. Even without doing it, Bart produces (in my opinion) good summarization results.
  2. Same goes for this point. Would always use BOS token as the starting decoder_input_ids for encoder-decoder models and force encoder-decoder models to have a BOS token to be able to do language generation.

Doing this would means, we would have to change the integration tests and we won't produce 1:1 the same results as fairseq anymore.

@sshleifer you can probably estimate the consequences for the Bart summarization quality much better than me! From the examples in the test_modeling_bart.py, I ran the summarization and the output looked good to me, but didn't measure ROUGE scores or anything...

What do you guys think?

@patrickvonplaten patrickvonplaten force-pushed the merge_bart_generate_into_default_generate branch 2 times, most recently from f173dd5 to 4f9d1f2 Compare March 9, 2020 16:00
@sshleifer
Copy link
Contributor

I ran eval on the full CNN test set with these simplifications, and Rouge decreases from .21072 to .20285.

For context, here are the published Rouge-2 scores of a bunch of different models:

image

Note: the published bart score is a bit higher (.2128) because there are even more tricks I didn't implement.

@patrickvonplaten patrickvonplaten force-pushed the merge_bart_generate_into_default_generate branch from 9b12642 to 1ba21f9 Compare March 11, 2020 10:08
@thomwolf
Copy link
Member

Ok let's merge this to be able to move forward with T5 cc @craffel
@julien-c @LysandreJik we think the self-hosted failing tests are not related to this.
Can you have a look later maybe?

@thomwolf thomwolf merged commit db29ffc into huggingface:master Mar 11, 2020
@patrickvonplaten patrickvonplaten deleted the merge_bart_generate_into_default_generate branch March 11, 2020 13:15
@thesamuel
Copy link

@patrickvonplaten You mentioned a way to perform batch inference with GPT2LMHeadModel using an attention mask here: #3021 (comment).

Does this PR make this possible by calling model.generate(input_ids=..., attention_mask=...)?

@patrickvonplaten
Copy link
Contributor Author

@patrickvonplaten You mentioned a way to perform batch inference with GPT2LMHeadModel using an attention mask here: #3021 (comment).

Does this PR make this possible by calling model.generate(input_ids=..., attention_mask=...)?

Hi @thesamuel,

Not yet completely. It's one step to make the generation possible as shown in #3021, but there are still two things that are not considered yet in the generate fn:

  1. The position embeddings have to be updated - which generate() does not do yet
  2. And this is the hard one: If a padded batch is given as an input, it should not be sampled from the last token, but from the last non-padded token and this can be quite hacky.

We are currently thinking about how to implement this!

@thesamuel
Copy link

@patrickvonplaten Got it, thanks!

@sshleifer
Copy link
Contributor

sshleifer commented Mar 12, 2020

(Investigating) This PR may introduce a BOS bug that reduces rouge to 15.068 from 21.28

sshleifer added a commit that referenced this pull request Mar 12, 2020
@sshleifer
Copy link
Contributor

Simple bug caused by do_sample (which for some reason defaults to True). Anyway, I'm rerunning rouge but it will likely be at a reasonable level.

@patrickvonplaten
Copy link
Contributor Author

Possible follow-up states are explained in PR: #3225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants