-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TF Causal LM models' returned logits #15256
Fix TF Causal LM models' returned logits #15256
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Could @Rocketknight1 or @gante give this a look? |
To summarize: In causal LM models, logits and labels are offset by 1 before loss computations. In PT, the pre-shift logits are returned in the output, but in TF the post-shift logits are returned instead. This seems like a huge bug - I don't understand how several tf-pt equivalence tests didn't fail before, and how things like |
Ah, actually, |
tf-pt equivalence tests don't test with |
@ydshieh that makes sense! And yes, in most use-cases when people are passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking solid! Let us know if you need a hand with failing tests.
tests/test_modeling_tf_common.py
Outdated
tf_nans = np.copy(np.isnan(tf_loss)) | ||
pt_nans = np.copy(np.isnan(pt_loss)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curiosity: any particular reason to use np.copy()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's copied from other places inside the same method. I don't see the necessity though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(checked, tf_nans/pt_nans are not just views, so np.copy is not necessary here I think)
probably it is because we modify the hidden_states
, which probably will affect the results of np.isnan
(depends on if this method return just a view or a new copied data )
tf_nans = np.copy(np.isnan(tf_hidden_states))
pt_nans = np.copy(np.isnan(pt_hidden_states))
pt_hidden_states[tf_nans] = 0
tf_hidden_states[tf_nans] = 0
pt_hidden_states[pt_nans] = 0
tf_hidden_states[pt_nans] = 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I believe it is returning a copy. If we run the commands below, we can see that the output doesn't change if we change the original input object with an in-place operation.
>>> import numpy as np
>>> a = np.asarray([np.nan, 1, 2, 3, np.nan])
>>> b = np.isnan(a)
>>> b
array([True, False, False, False, True])
>>> a.sort()
>>> a
array([ 1., 2., 3., nan, nan])
>>> b
array([True, False, False, False, True])
That being said, it should be safe to remove the copy. If we get errors from removing the copy, then we can add this quirk as a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, thanks for testing, @gante ! If you & @Rocketknight1 are OK, I can change it in this PR (although it is not this PR's original intention, but I'm fine with it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed np.copy
, and the test is fine :)
tests/test_modeling_tf_common.py
Outdated
# `TFFunnelForTokenClassification` (and potentially other TF token classification models) give | ||
# large difference (up to 0.1x). PR #15294 addresses this issue. | ||
# There is also an inconsistency between PT/TF `XLNetLMHeadModel`. | ||
# Before these issues are fixed & merged, set a higher threshold here to pass the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this reason, I think we should merge this PR last (in your batch of PRs). That way, we can set a low comparison threshold :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failed tests seem irrelevant to this PR. But we can wait a bit - I will rebase in the next 2 days to see if everything will be OK.
b4afeb8
to
d3c28dd
Compare
I make this PR as draft again, because there are other potential PT/TF inconsistency uncaught. Here is the failed output in the test
|
Having random tests might actually uncover interesting corner cases, but failures risk being overlooked if we can't get the random seed back 🤔 @sgugger , what's our go-to policy with respect to tests -- as deterministic as possible, or is randomness embraced? |
It would be best to keep those random indeed. |
We can discuss about this for another PR. Here the issues I can identify are more deterministic -- one issue is coming from a sequence with all attention mask being 0 (although randomly generated in this case). So for that particular case, we can just add an extra test case (with such an attention mask). See #15326 if you are interested - but it is not TF related. |
Woow, great catch @ydshieh ! |
a36ed62
to
6a967f1
Compare
Hi, I reverted the change in |
As always, thank you @ydshieh, this was a great contribution. Will you be working on the improved equivalence test? (If not, I might pick it up :) ) |
I already have one version (that is used for hunting the bugs :-) ). Would definitely open a PR to include it once other issues are fixed (so the test won't fail) |
basically, the new version checks in a recursively way: tuple --> check each element inside it A common issue is that the tuples returned by PT/TF have different length. Another one is the tensors have different shape. If you want to go ahead, I am fine with it though :-) |
No, by all means, we are very grateful that you are working on it ❤️ |
* Fix TF Causal LM models' returned logits * Fix expected shape in the tests Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
Fix TF causal LM models' returned logits
Details
In TF causal LM models, the returned logits is the one being cut
While for PyTorch causal LM models, the original logits is returned
This PR fixes this inconsistency + test the cases where
labels
is passed in PT/TF equivalence test.