Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TF Causal LM models' returned logits #15256

Merged
merged 17 commits into from
Feb 1, 2022

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Jan 20, 2022

What does this PR do?

Fix TF causal LM models' returned logits

Details

In TF causal LM models, the returned logits is the one being cut

        if inputs["labels"] is not None:
            # shift labels to the left and cut last logit token
            **logits = logits[:, :-1]**
            labels = inputs["labels"][:, 1:]
            loss = self.hf_compute_loss(labels=labels, logits=logits)

        return TFCausalLMOutputWithPast(
            loss=loss,
            **logits=logits,**

While for PyTorch causal LM models, the original logits is returned

        if labels is not None:
            # Shift so that tokens < n predict n
            **shift_logits = lm_logits[..., :-1, :].contiguous()**
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        return CausalLMOutputWithPast(
            loss=loss,
            **logits=lm_logits,**

This PR fixes this inconsistency + test the cases where labels is passed in PT/TF equivalence test.

@HuggingFaceDocBuilder
Copy link

HuggingFaceDocBuilder commented Jan 20, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh ydshieh changed the title [WIP] Fix TF Causal LM models' returned logits Fix TF Causal LM models' returned logits Jan 22, 2022
@ydshieh ydshieh marked this pull request as ready for review January 22, 2022 09:44
@LysandreJik
Copy link
Member

Could @Rocketknight1 or @gante give this a look?

@Rocketknight1
Copy link
Member

To summarize: In causal LM models, logits and labels are offset by 1 before loss computations. In PT, the pre-shift logits are returned in the output, but in TF the post-shift logits are returned instead. This seems like a huge bug - I don't understand how several tf-pt equivalence tests didn't fail before, and how things like generate() didn't completely fail as a result too. I need to investigate the code more deeply to understand how this ever worked.

@Rocketknight1
Copy link
Member

Ah, actually, generate() would not pass labels and so the shift would never happen.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jan 24, 2022

tf-pt equivalence tests

tf-pt equivalence tests don't test with labels as far as I know. And this PR tries to add it too.

@Rocketknight1
Copy link
Member

@ydshieh that makes sense! And yes, in most use-cases when people are passing labels, they're usually interested in the loss output and not the logits output. So that probably explains how this was invisible for so long. Great work!

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking solid! Let us know if you need a hand with failing tests.

Comment on lines 432 to 433
tf_nans = np.copy(np.isnan(tf_loss))
pt_nans = np.copy(np.isnan(pt_loss))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiosity: any particular reason to use np.copy()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's copied from other places inside the same method. I don't see the necessity though.

Copy link
Collaborator Author

@ydshieh ydshieh Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(checked, tf_nans/pt_nans are not just views, so np.copy is not necessary here I think)


probably it is because we modify the hidden_states, which probably will affect the results of np.isnan (depends on if this method return just a view or a new copied data )

            tf_nans = np.copy(np.isnan(tf_hidden_states))
            pt_nans = np.copy(np.isnan(pt_hidden_states))

            pt_hidden_states[tf_nans] = 0
            tf_hidden_states[tf_nans] = 0
            pt_hidden_states[pt_nans] = 0
            tf_hidden_states[pt_nans] = 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

I believe it is returning a copy. If we run the commands below, we can see that the output doesn't change if we change the original input object with an in-place operation.

>>> import numpy as np
>>> a = np.asarray([np.nan, 1, 2, 3, np.nan])
>>> b = np.isnan(a)
>>> b
array([True, False, False, False,  True])
>>> a.sort()
>>> a
array([ 1.,  2.,  3., nan, nan])
>>> b
array([True, False, False, False,  True])

That being said, it should be safe to remove the copy. If we get errors from removing the copy, then we can add this quirk as a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, thanks for testing, @gante ! If you & @Rocketknight1 are OK, I can change it in this PR (although it is not this PR's original intention, but I'm fine with it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed np.copy, and the test is fine :)

Comment on lines 440 to 443
# `TFFunnelForTokenClassification` (and potentially other TF token classification models) give
# large difference (up to 0.1x). PR #15294 addresses this issue.
# There is also an inconsistency between PT/TF `XLNetLMHeadModel`.
# Before these issues are fixed & merged, set a higher threshold here to pass the test.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this reason, I think we should merge this PR last (in your batch of PRs). That way, we can set a low comparison threshold :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failed tests seem irrelevant to this PR. But we can wait a bit - I will rebase in the next 2 days to see if everything will be OK.

@ydshieh ydshieh force-pushed the fix_tf_causal_lm_returned_logits branch from b4afeb8 to d3c28dd Compare January 25, 2022 07:25
@ydshieh
Copy link
Collaborator Author

ydshieh commented Jan 25, 2022

I make this PR as draft again, because there are other potential PT/TF inconsistency uncaught.
(The test inputs are randomly generated, so we don't always have the same outputs each run)

Here is the failed output in the test

                # Some models require extra condition to return loss. For example, `BertForPreTraining` requires both
                # `labels` and `next_sentence_label`.
                # Moreover, some PT models return loss while the corresponding TF/Flax models don't.
                if tf_loss is not None and pt_loss is not None:
    
                    tf_loss = tf.math.reduce_mean(tf_loss).numpy()
                    pt_loss = pt_loss.numpy()
    
                    tf_nans = np.copy(np.isnan(tf_loss))
                    pt_nans = np.copy(np.isnan(pt_loss))
                    # the 2 losses need to be both nan or both not nan
                    # (`TapasForQuestionAnswering` gives nan loss here)
>                   self.assertEqual(tf_nans, pt_nans)
E                   AssertionError: array(False) != array(True)

@ydshieh ydshieh marked this pull request as draft January 25, 2022 08:54
@gante
Copy link
Member

gante commented Jan 25, 2022

I make this PR as draft again, because there are other potential PT/TF inconsistency uncaught. (The test inputs are randomly generated, so we don't always have the same outputs each run)

Having random tests might actually uncover interesting corner cases, but failures risk being overlooked if we can't get the random seed back 🤔 @sgugger , what's our go-to policy with respect to tests -- as deterministic as possible, or is randomness embraced?

@sgugger
Copy link
Collaborator

sgugger commented Jan 25, 2022

It would be best to keep those random indeed.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jan 25, 2022

It would be best to keep those random indeed.

We can discuss about this for another PR. Here the issues I can identify are more deterministic -- one issue is coming from a sequence with all attention mask being 0 (although randomly generated in this case).

So for that particular case, we can just add an extra test case (with such an attention mask).

See #15326 if you are interested - but it is not TF related.

@patrickvonplaten
Copy link
Contributor

Woow, great catch @ydshieh !

@ydshieh ydshieh force-pushed the fix_tf_causal_lm_returned_logits branch from a36ed62 to 6a967f1 Compare January 31, 2022 21:20
@ydshieh ydshieh marked this pull request as ready for review January 31, 2022 21:21
@ydshieh
Copy link
Collaborator Author

ydshieh commented Jan 31, 2022

Hi, I reverted the change in test_pt_tf_model_equivalence - it requires some time to fix the issues I could find. I don't want to this PR being blocked due to this new version of test. If you feel OK, we can merge this PR for now. Thank you!

@gante gante merged commit dc05dd5 into huggingface:master Feb 1, 2022
@gante
Copy link
Member

gante commented Feb 1, 2022

As always, thank you @ydshieh, this was a great contribution. Will you be working on the improved equivalence test? (If not, I might pick it up :) )

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 1, 2022

As always, thank you @ydshieh, this was a great contribution. Will you be working on the improved equivalence test? (If not, I might pick it up :) )

I already have one version (that is used for hunting the bugs :-) ). Would definitely open a PR to include it once other issues are fixed (so the test won't fail)

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 1, 2022

basically, the new version checks in a recursively way:

tuple --> check each element inside it
tensor --> check shape, and the abs max diff of tensor elements

A common issue is that the tuples returned by PT/TF have different length. Another one is the tensors have different shape.

If you want to go ahead, I am fine with it though :-)

@gante
Copy link
Member

gante commented Feb 1, 2022

No, by all means, we are very grateful that you are working on it ❤️

gante pushed a commit to gante/transformers that referenced this pull request Feb 1, 2022
* Fix TF Causal LM models' returned logits

* Fix expected shape in the tests

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
@ydshieh ydshieh deleted the fix_tf_causal_lm_returned_logits branch May 5, 2022 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants