Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feed forward chunking #6024

Merged
merged 14 commits into from
Aug 11, 2020
Merged

Conversation

Pradhy729
Copy link
Contributor

Official PR for #5928

@Pradhy729
Copy link
Contributor Author

@patrickvonplaten - here's an initial implementation I have. My first step is to get the model to work with chunked feed forward - and it works! I still need to run the benchmark test to find out the benefits in terms of memory.

However, I see a problem. The new architecture causes some of the nn.Module weights and bias parameter-names to change - which would be a problem with loading existing pretrained weights from checkpoints.
For example:
bert.encoder.layer.0.intermediate.dense.weight --> becomes bert.encoder.layer.0.feed_forward.dense.dense.weight

See the failing tests for more details. Any thoughts/ideas on how to get around this?

@@ -159,6 +159,7 @@ def __init__(self, **kwargs):
self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
self.bad_words_ids = kwargs.pop("bad_words_ids", None)
self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
self.chunk_size_feed_forward = kwargs.pop("chunk_size_feed_forward", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be disabled by default -> so would be nice to set it to 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to make sure we don't break backward compatibility...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in modeling_bert.py, refer to #6972. thanks.

@@ -408,8 +432,7 @@ def forward(
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:] # add cross attentions if we output attention weights

intermediate_output = self.intermediate(attention_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve the problem, my suggestions would be to wrap these two calls in a function forward_chunk which is part of this class (def forward_chunk(self, ....)) and call apply_chunking_to_forward(self.chunk_size_feed_forward, self.seq_len_dim, self.forward_chunk, attention_output,) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I quite follow what you mean here. Which two calls do you want to wrap?
Did you mean to have a forward_chunk function in the BertLayer class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I fixed it based on your input - looks ok to me now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that's exactly what I meant :-)

@Pradhy729 Pradhy729 force-pushed the feed_forward_chunking branch from ab11e31 to 5621e1f Compare August 7, 2020 04:02
Pradhy729 and others added 5 commits August 6, 2020 21:24
fix the shuffle agrument usage and the default (huggingface#6307)
This is an initial implementation to test applying feed forward chunking for BERT.
Will need additional modifications based on output and benchmark results.
@Pradhy729 Pradhy729 force-pushed the feed_forward_chunking branch from 44aea51 to 5655ddf Compare August 7, 2020 04:25
@codecov
Copy link

codecov bot commented Aug 7, 2020

Codecov Report

Merging #6024 into master will decrease coverage by 0.31%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6024      +/-   ##
==========================================
- Coverage   79.44%   79.12%   -0.32%     
==========================================
  Files         148      148              
  Lines       27193    27198       +5     
==========================================
- Hits        21604    21521      -83     
- Misses       5589     5677      +88     
Impacted Files Coverage Δ
src/transformers/configuration_reformer.py 100.00% <ø> (ø)
src/transformers/configuration_utils.py 96.57% <100.00%> (+0.02%) ⬆️
src/transformers/modeling_bert.py 88.49% <100.00%> (+0.09%) ⬆️
src/transformers/modeling_tf_bert.py 69.31% <0.00%> (-26.18%) ⬇️
src/transformers/tokenization_roberta.py 76.71% <0.00%> (-21.92%) ⬇️
src/transformers/tokenization_utils_base.py 86.43% <0.00%> (-7.42%) ⬇️
src/transformers/tokenization_transfo_xl.py 38.73% <0.00%> (-3.76%) ⬇️
src/transformers/tokenization_utils_fast.py 92.14% <0.00%> (-2.15%) ⬇️
src/transformers/tokenization_openai.py 82.57% <0.00%> (-1.52%) ⬇️
src/transformers/tokenization_bert.py 91.07% <0.00%> (-0.45%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 175cd45...406d621. Read the comment docs.

@patrickvonplaten patrickvonplaten changed the title [WIP] Feed forward chunking Feed forward chunking Aug 8, 2020
@hamad8395

This comment was marked as spam.

@patrickvonplaten
Copy link
Contributor

Hey @Pradhy729, thanks a lot for continuing the PR. I made a couple of changes: fix the docs and added tests for all models, whereas only Reformer and Bert tests are on for now.

Would be great if @LysandreJik @sgugger @thomwolf @sshleifer can review.

This PR shows how feed_forward_chunking can be employed for all models. Feed forward chunking is explained here: https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers in combination with some benchmarking. It can give good memory improvements for certain model architectures.
For Bert a test is added showing that the model gives equal results. This function can easily be added to other models, the same way it was done for BERT. There is no real drawback in implementing this IMO.

To-Do after review is positive:

  1. Add feed forward chunking to more models. @Pradhy729, feel free to add it to as many models as you want. The rest can also be added in a new PR or we open a "first good issue" for it.
  2. Add feed forward chunking for language modeling loss. Chunking of feed forward layers in the attention block is often not really helpful to save memory - only if the model has very few attention heads. On the other hand, a real bottleneck is often the last word embedding layer. When training the loss does not have to be calculated in one huge batch (over time dim), but can be chunked the same way it is done here for Feed forward layers. This is not even really implemented in Reformer yet and would definitely require a new PR.

@Pradhy729
Copy link
Contributor Author

Great! Thanks @patrickvonplaten
Will wait for reviewers and start working on the others.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very useful work, thanks for tackling this!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, love how it only requires a few lines of code now that apply_chunking_to_forward is created.

Comment on lines +421 to +423
layer_output = apply_chunking_to_forward(
self.chunk_size_feed_forward, self.seq_len_dim, self.feed_forward_chunk, attention_output
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much a nitpick here, for future PRs probably, but this looks a lot like the gradient checkpointing method from PyTorch. This method takes the callable (the forward) method as first positional argument and I think it makes sense to have it this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do this globally in the new PR where I add the chunking for other models. Let me know if you have concerns with that.

@LysandreJik LysandreJik merged commit b25cec1 into huggingface:master Aug 11, 2020
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants