-
Notifications
You must be signed in to change notification settings - Fork 30.7k
TF Longformer #5764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF Longformer #5764
Conversation
4d7261c
to
9fa8ddc
Compare
375a2d3
to
d4aef18
Compare
In |
Yes, It is still an issue with the AutoGraph thing 😢 I suggest to comment them for now. |
I'm currently thinking on how to properly rework all the booleans handling in TF. As this is the main issue. |
Ok...will leave it for now since the test are only in |
Confirmed that this PR will not slow down the Longformer PyTorch version. Running the following benchmark on master:
gives same performance is in as in #5811. |
Benchmarking the model in TF leads to a slow-down vs. PyTorch of ca. 1.5, which is reasonable IMO: Running:
gives:
At the moment running the model in XLA on GPU fails...=> should take a closer look in a next PR. |
Codecov Report
@@ Coverage Diff @@
## master #5764 +/- ##
==========================================
+ Coverage 79.33% 79.38% +0.05%
==========================================
Files 148 149 +1
Lines 27196 27670 +474
==========================================
+ Hits 21577 21967 +390
- Misses 5619 5703 +84
Continue to review full report at Codecov.
|
Where do I send the pizza & beer to get this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, looks good to me! Looking forward to the post-PR upgrades as well!
global_query_vectors_only_global /= math.sqrt(self.head_dim) | ||
global_query_vectors_only_global /= torch.sqrt( | ||
torch.tensor( | ||
self.head_dim, | ||
device=global_query_vectors_only_global.device, | ||
dtype=global_query_vectors_only_global.dtype, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is the creation of a tensor + pytorch operation faster than the previous math operation on a tensor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noting that, I actually didn't time it and thought it's nice to get rid of the math library.
From these tests on a GPU: https://colab.research.google.com/drive/1q830_JmCLyxmlQCTBxe3Rj2G3pZuniCY?usp=sharing it seems that it's actually slower to first create a tensor and use torch.sqrt
in PyTorch. In TF it seems to be equally fast on GPU and a bit faster on CPU (in tf.function mode) in eager model math.sqrt
is faster.
So for PyTorch I will revert the change and I guess and I have to update my beliefs a bit about "doing everything in tensors".
Is this obvious to you @LysandreJik that using math
is faster? I would have assumed that under the hood a tensor is created anyways....do you have an idea why math
is faster?
Also cc @mfuntowicz, maybe you have a good opinion on that :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it isn't obvious to me at all! I would have thought that dividing a tensor by a python value would create a tensor for the python value, resulting in the same number of operations. I think it's cleaner to only use torch like functions too, but from your benchmarking it seems that it's not always the case!
Thanks for diving into it, really useful!
if global_attention_mask is None: | ||
if global_attention_mask is None and input_ids is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Curious) The global_attention_mask
doesn't need to be computed when passing in inputs_embeds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! In this case they cannot really be set automatically since its hard to trace back where the sep_token_id
was in input_embeds
-> will change it and display a warning in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you'd prefer that I do the return_dict
part. I think it would be good to have it in the PR before merging but I can push it on this branch.
99af69b
to
a05989c
Compare
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
787605e
to
dfd729e
Compare
I'm curious if you think the longformer should be added to the language_generation_model? |
@sgugger - I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes, @patrickvonplaten!
We will add it to the EncoderDecoderModel framework, where it can be used with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some nits. Nice tests!
This class overrides :class:`~transformers.RobertaModel` to provide the ability to process | ||
long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer | ||
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention | ||
This class copied code from :class:`~transformers.RobertaModel` and overwrote standard self-attention with longformer self-attention to provide the ability to process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nittiest of all nits)
present tense verbs for docstrings.
copies, overwrites. The class is a thing that exists in the present, at least from the reader's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Thanks, @patrickvonplaten.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
This PR adds Longformer
In a first step, it is made sure that code is clean and that all tests pass. Todo:
ToDo List:
ToDo after PR is merged:
For Review
This PR adds
TFLongformer
and the two most important parent classesTFLongformerForMaskedLM
andTFLongformerForQuestionAnswering
. Many tests are added to verify that TFLongformer gives identical results to PT Longformer and a colab notebook (see below) is attached to show performance on a real task.Below you can find a Benchmark showing that TFLongformer is about 1.5x slower than PT on GPU. For now this is acceptable IMO, but in a future PR I want to take a deeper look at how TF code can be optimized and also solve a problem there is currently with TF XLA.
I spent a lot of time, trying to solve this issue: #5815 for TFLongformer and didn't manage to find a good solution. The corresponding tests are in
SLOW
mode so they won't fail on this PR. Since we are currently thinking about a better solution than usingcast_bool_to_primitive
to solve the know TF graph boolean error, I think I will leave this small bug in TFLongformer for now (it's quite an edge IMO anymays).Docs are added and checked, comments are added, performance on TriviaQA is verified in TF colab: https://colab.research.google.com/drive/1UmU3T1nPmJ2LgXQtPcaEVtXUnoBJ1onF?usp=sharing and TF weights were added to all longformer models here: https://huggingface.co/models?search=longformer.
Would be happy about a review @jplu @ibeltagy @LysandreJik @sshleifer @sgugger @julien-c @thomwolf