-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge bart generate into default generate #3140
Merge bart generate into default generate #3140
Conversation
clarification:
Example: ARTICLE_TO_SUMMARIZE = "code doesnt generate EOS if max_length is hit"
inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], return_tensors='pt')
generated_ids = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], num_beams=4, max_length=5)
summary_text = tokenizer.decode(generated_ids[0])
print(generated_ids[0], summary_text)
# (tensor([ 0, 2387, 964, 32, 3035]), '<s>My friends are cool') |
1963a94
to
87a78f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok good, just a fix on the config.
Regarding the question of the attention mask, here is my (more general) take on it:
In the library, we've initially decided not to mix models and tokenizer and as thus, ask the users to compute the attention mask outside of the forward pass (currently provided by the tokenizers).
Looking back on it, I think it was a good idea to split the two building blocks from the point of view of allowing the user to better understand what's happening and organize his workflow. On the other hand, the very specific case of designing an attention mask has become a typical silent error for many people using the models: they pad the input in batches and then forget to feed the attention mask to the model forward pass.
As we start having a common configuration file, the padding token index is now available in the model and I think we could propose a general (non-breaking) evolution of this behavior for the users, to reduce this source of silent mistake.
I would thus like to propose the following workflow for the forward pass of all models:
- if an
attention_mask
is provided, we use it (like now) - if no
attention_mask
is provided but we have apadding_token_id
in the configuration, we build a specificattention_mask
for the input from the padding token (new behavior, could be breaking in some hacky usages of the lib where a padding token is used for something other than padding...) - if no
attention_mask
is provided and we don't have apadding_token_id
in the configuration, we build a generic full attentionattention_mask
(like we did before).
What do you think especially @LysandreJik and @julien-c
src/transformers/modeling_utils.py
Outdated
@@ -806,37 +816,66 @@ def generate( | |||
effective_batch_size * num_beams, input_ids_len | |||
) # shape: (batch_size * num_return_sequences * num_beams, cur_len) | |||
|
|||
# TODO (PVP): probably not the best way to check whether model is encoder decoder | |||
is_encoder_decoder = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this something to clean, let's add a is_encoder_decoder
attribute in the configuration of Bart and T5, ok?
cc @sshleifer
We should clean up the T5 implementation also, this is_decoder
flag should not be in the config but should be an argument to the __init__
method:
transformers/src/transformers/modeling_t5.py
Line 193 in 7336896
self.is_decoder = config.is_decoder |
87a78f6
to
c4da5cc
Compare
Sounds good to me |
By the way, my workflow proposition actually implied that we should use the same workflow and inputs for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to give it a deeper read before you merge, Ping me!
src/transformers/modeling_utils.py
Outdated
eos_token_id, | ||
# bos_token_id, | ||
# eos_token_id, # Why eos_token_id here? bos_token_id seems to work as well ... to see if it works as well with hard summarization case | ||
eos_token_id, # TODO (PVP): to check if this is the only solution -> quite hacky to do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fairseq way is even crazier! look for prefix_tokens
in sequence_generator.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, not sure whether all encoder-decoder
models should start with eos_token_id
though. Happy to discuss! @thomwolf @sshleifer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the encoder-decoder output sequence start with eos_token_id
.
This seems both weird and probably not generalizable to models behind Bart to me.
Can you explain this better for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(read the prefix tokens comment before this one)
My understanding is that since we know we are going to generate BOS first, we want the score associated with BOS to be P(BOS | EOS), because that seems reasonable. So the first step, even though we will force BOS, we send EOS as input_ids to get the appropriate score.
Codecov Report
@@ Coverage Diff @@
## master #3140 +/- ##
==========================================
- Coverage 78.14% 77.99% -0.16%
==========================================
Files 98 98
Lines 16668 16665 -3
==========================================
- Hits 13026 12998 -28
- Misses 3642 3667 +25
Continue to review full report at Codecov.
|
src/transformers/modeling_utils.py
Outdated
scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size) | ||
if ( | ||
self.config.is_encoder_decoder and do_sample is False | ||
): # TODO(PVP) to be refactored later - do we need this boolean flag here? Also Only add for beam_search or also for no_beam_search? The prepare scores fn is ugly here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very hacky here - don't like the if statement and the prepare_scores_for_generation
fn too much. Better ideas? @sshleifer @thomwolf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I don't understand why we need the prepare_scores_for_generation
method (cf. my comment on it).
Let's talk about it @sshleifer @patrickvonplaten
UPDATE: This version passes all integration tests now. There are two things which I quite hacky we probably should implement in a cleaner way:
|
a4ba33c
to
eb3ba73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok it's great to see a convergence here.
As mentioned in the comment there are a couple of method that I don't understand and which seems to be potentially very Bart's specific so I'm not sure we want them in the base class (I don't really understance their use).
Let's discuss them before merging.
src/transformers/modeling_utils.py
Outdated
device=next(self.parameters()).device, | ||
) | ||
cur_len = 1 | ||
self.model.decoder.generation_mode = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this generation_mode
in Bart which this rather specific to the fairseq way to do it and probably won't be used in other encoder-decoder models.
I'll remove it in my review of Bart if I can. In the meantime for these model specific stuff we can either (1) activate it in the model specific prepare_inputs_for_generation
or (2) run behind a if hasattr(self.model, 'generation_mode')
test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
src/transformers/modeling_utils.py
Outdated
|
||
if num_beams > 1: | ||
output = self._generate_beam_search( | ||
input_ids, | ||
cur_len, | ||
max_length, | ||
min_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these long list of arguments, I think it's safer to have them all called by keyword names:
max_length=max_length,
min_length=min_length,
I know it's longer to write but it's more robust to changes in these methods API later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though if you do it, be careful not to introduce bugs in LHS, RHS equality, better to do it will multi-cursors if you can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
src/transformers/modeling_utils.py
Outdated
eos_token_id, | ||
# bos_token_id, | ||
# eos_token_id, # Why eos_token_id here? bos_token_id seems to work as well ... to see if it works as well with hard summarization case | ||
eos_token_id, # TODO (PVP): to check if this is the only solution -> quite hacky to do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the encoder-decoder output sequence start with eos_token_id
.
This seems both weird and probably not generalizable to models behind Bart to me.
Can you explain this better for me?
} | ||
|
||
def prepare_scores_for_generation(self, scores, cur_len, max_length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is really weird. Do we need that?
Can one of you explain this to me? @sshleifer @patrickvonplaten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fairseq generation (for all models) has the following rules:
- Never generate pad or unk
- if we have hit the max_length constraint, force the model to predict EOS.
- when I say force, I mean set all other words' scores to -inf, but keep the true probability of the forced token so that the hypothesis is scored accurately.
- Optional tensor argument
prefix_tokens
: for each stepi
, force the model to predict prefix_tokens[:, step]
- for bart the prefix tokens are a (batch_size,1) tensor filled with BOS
- for language_modeling, I believe the prefix tokens are the whole input_ids sequence.
(what they call src_tokens
, we call input_ids
)
That's what this function is doing, deciding what to force based on the step.
src/transformers/modeling_utils.py
Outdated
scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size) | ||
if ( | ||
self.config.is_encoder_decoder and do_sample is False | ||
): # TODO(PVP) to be refactored later - do we need this boolean flag here? Also Only add for beam_search or also for no_beam_search? The prepare scores fn is ugly here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I don't understand why we need the prepare_scores_for_generation
method (cf. my comment on it).
Let's talk about it @sshleifer @patrickvonplaten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super interesting/difficult PR. I left comments explaining some of the tricks.
High Level Points:
- Nearly all of the tricks are from fairseq's
sequence_generator.py
which is used for nearly all conditional generation as far as I can tell. - I still think it is somewhat better to wait for this refactor until we have more seq2seq callers, but I don't feel strongly and am open to being outvoted.
- Happy to discuss on any forum!
} | ||
|
||
def prepare_scores_for_generation(self, scores, cur_len, max_length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fairseq generation (for all models) has the following rules:
- Never generate pad or unk
- if we have hit the max_length constraint, force the model to predict EOS.
- when I say force, I mean set all other words' scores to -inf, but keep the true probability of the forced token so that the hypothesis is scored accurately.
- Optional tensor argument
prefix_tokens
: for each stepi
, force the model to predict prefix_tokens[:, step]
- for bart the prefix tokens are a (batch_size,1) tensor filled with BOS
- for language_modeling, I believe the prefix tokens are the whole input_ids sequence.
(what they call src_tokens
, we call input_ids
)
That's what this function is doing, deciding what to force based on the step.
src/transformers/modeling_utils.py
Outdated
eos_token_id, | ||
# bos_token_id, | ||
# eos_token_id, # Why eos_token_id here? bos_token_id seems to work as well ... to see if it works as well with hard summarization case | ||
eos_token_id, # TODO (PVP): to check if this is the only solution -> quite hacky to do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(read the prefix tokens comment before this one)
My understanding is that since we know we are going to generate BOS first, we want the score associated with BOS to be P(BOS | EOS), because that seems reasonable. So the first step, even though we will force BOS, we send EOS as input_ids to get the appropriate score.
@thomwolf, @sshleifer Thanks for your reviews guys. I think we are on the same page except for two things:
My opinion is:
Doing this would means, we would have to change the integration tests and we won't produce 1:1 the same results as fairseq anymore. @sshleifer you can probably estimate the consequences for the Bart summarization quality much better than me! From the examples in the What do you guys think? |
f173dd5
to
4f9d1f2
Compare
I ran eval on the full CNN test set with these simplifications, and Rouge decreases from .21072 to .20285. For context, here are the published Rouge-2 scores of a bunch of different models: Note: the published bart score is a bit higher (.2128) because there are even more tricks I didn't implement. |
9b12642
to
1ba21f9
Compare
Ok let's merge this to be able to move forward with T5 cc @craffel |
@patrickvonplaten You mentioned a way to perform batch inference with Does this PR make this possible by calling |
Hi @thesamuel, Not yet completely. It's one step to make the generation possible as shown in #3021, but there are still two things that are not considered yet in the generate fn:
We are currently thinking about how to implement this! |
@patrickvonplaten Got it, thanks! |
(Investigating) This PR may introduce a BOS bug that reduces rouge to 15.068 from 21.28 |
Simple bug caused by |
Possible follow-up states are explained in PR: #3225 |
…3233) * Update bart example docs
This PR is a first version of how the bart generation() code can be merged into the "default" generation function. I think it's actually much better feasible than we originally thought.
Please not that this change also includes all the changes from PR #3135 , so the code changes will be much less cluttered after #3135 merged.
This version was passes the general random language generation tests in found in
test_modeling_common.py
and the easy integration test with the original fairseq model (renamed totest_cnn_summarization_same_as_fairseq_easy
intest_modeling_bart
)There are a couple of things we should discuss:
Both Bart generate() and default generate(), encoder-decoder models must have a BOS token and an EOS token.
Two new parameters were added:
min_length
andno_repeat_ngram_size
. I think these parameters should be added generally as it is done now.There was one hack which initializes the
decoder_input_ids
to the EOS token and then forces the model to generate the BOS token afterwards (see comment in code line). I changed it to simply start with the BOS token (which makes more sense) and it also passed the "easy integration tests". This hack might be needed to pass the hard integration test though.Fairseq forces the last token of all beam hypotheses to be the EOS token (see comment in line). This is probably necessary to pass the integration tests. It's up for debate whether this the correct way. I would prefer not to do it this way because it will then be impossible to generate unfinished sentences (sentence that end because they hit
max_length
). If one really wants all beam hypotheses to be finished, one could set themax_length
higher than usual and set the parameter:self.early_stopping
in the Beam Hypotheses class toTrue
. Up for debate how to handle this.In order to also pass the hard integration tests (which has a padded batch as an input), we will have to add
attention_masks
to thegenerate
() function. Here I see three possibilities:a) add the
attention_mask
as a parameter to the generation() function.b) automatically calculate the
attention_mask
from theinput_ids
if the model has apad_token_id
and there is apad_token_id
in the input_ids.c) Not allow padded batches for the moment.
I would prefer option b) because some models do have a set
pad_token_id
(such as Bart) so we should be able to allow padded generation.