Fix TF Causal LM models' returned logits #15256

ydshieh · 2022-01-20T16:47:47Z

What does this PR do?

Fix TF causal LM models' returned logits

Details

In TF causal LM models, the returned logits is the one being cut

        if inputs["labels"] is not None:
            # shift labels to the left and cut last logit token
            **logits = logits[:, :-1]**
            labels = inputs["labels"][:, 1:]
            loss = self.hf_compute_loss(labels=labels, logits=logits)

        return TFCausalLMOutputWithPast(
            loss=loss,
            **logits=logits,**

While for PyTorch causal LM models, the original logits is returned

        if labels is not None:
            # Shift so that tokens < n predict n
            **shift_logits = lm_logits[..., :-1, :].contiguous()**
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        return CausalLMOutputWithPast(
            loss=loss,
            **logits=lm_logits,**

This PR fixes this inconsistency + test the cases where labels is passed in PT/TF equivalence test.

HuggingFaceDocBuilder · 2022-01-20T16:48:09Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik · 2022-01-24T14:08:05Z

Could @Rocketknight1 or @gante give this a look?

Rocketknight1 · 2022-01-24T17:55:12Z

To summarize: In causal LM models, logits and labels are offset by 1 before loss computations. In PT, the pre-shift logits are returned in the output, but in TF the post-shift logits are returned instead. This seems like a huge bug - I don't understand how several tf-pt equivalence tests didn't fail before, and how things like generate() didn't completely fail as a result too. I need to investigate the code more deeply to understand how this ever worked.

Rocketknight1 · 2022-01-24T18:10:31Z

Ah, actually, generate() would not pass labels and so the shift would never happen.

ydshieh · 2022-01-24T18:14:26Z

tf-pt equivalence tests

tf-pt equivalence tests don't test with labels as far as I know. And this PR tries to add it too.

Rocketknight1 · 2022-01-24T18:15:59Z

@ydshieh that makes sense! And yes, in most use-cases when people are passing labels, they're usually interested in the loss output and not the logits output. So that probably explains how this was invisible for so long. Great work!

gante

Looking solid! Let us know if you need a hand with failing tests.

gante · 2022-01-24T20:24:40Z

tests/test_modeling_tf_common.py

+                    tf_nans = np.copy(np.isnan(tf_loss))
+                    pt_nans = np.copy(np.isnan(pt_loss))


Curiosity: any particular reason to use np.copy()?

It's copied from other places inside the same method. I don't see the necessity though.

(checked, tf_nans/pt_nans are not just views, so np.copy is not necessary here I think)

probably it is because we modify the hidden_states, which probably will affect the results of np.isnan (depends on if this method return just a view or a new copied data )

tf_nans = np.copy(np.isnan(tf_hidden_states)) pt_nans = np.copy(np.isnan(pt_hidden_states)) pt_hidden_states[tf_nans] = 0 tf_hidden_states[tf_nans] = 0 pt_hidden_states[pt_nans] = 0 tf_hidden_states[pt_nans] = 0

Good point!

I believe it is returning a copy. If we run the commands below, we can see that the output doesn't change if we change the original input object with an in-place operation.

>>> import numpy as np >>> a = np.asarray([np.nan, 1, 2, 3, np.nan]) >>> b = np.isnan(a) >>> b array([True, False, False, False, True]) >>> a.sort() >>> a array([ 1., 2., 3., nan, nan]) >>> b array([True, False, False, False, True])

That being said, it should be safe to remove the copy. If we get errors from removing the copy, then we can add this quirk as a comment.

I agree, thanks for testing, @gante ! If you & @Rocketknight1 are OK, I can change it in this PR (although it is not this PR's original intention, but I'm fine with it).

I removed np.copy, and the test is fine :)

gante · 2022-01-24T20:26:16Z

tests/test_modeling_tf_common.py

+                        # `TFFunnelForTokenClassification` (and potentially other TF token classification models) give
+                        # large difference (up to 0.1x). PR #15294 addresses this issue.
+                        # There is also an inconsistency between PT/TF `XLNetLMHeadModel`.
+                        # Before these issues are fixed & merged, set a higher threshold here to pass the test.


For this reason, I think we should merge this PR last (in your batch of PRs). That way, we can set a low comparison threshold :)

The failed tests seem irrelevant to this PR. But we can wait a bit - I will rebase in the next 2 days to see if everything will be OK.

ydshieh · 2022-01-25T08:53:57Z

I make this PR as draft again, because there are other potential PT/TF inconsistency uncaught.
(The test inputs are randomly generated, so we don't always have the same outputs each run)

Here is the failed output in the test

                # Some models require extra condition to return loss. For example, `BertForPreTraining` requires both
                # `labels` and `next_sentence_label`.
                # Moreover, some PT models return loss while the corresponding TF/Flax models don't.
                if tf_loss is not None and pt_loss is not None:
    
                    tf_loss = tf.math.reduce_mean(tf_loss).numpy()
                    pt_loss = pt_loss.numpy()
    
                    tf_nans = np.copy(np.isnan(tf_loss))
                    pt_nans = np.copy(np.isnan(pt_loss))
                    # the 2 losses need to be both nan or both not nan
                    # (`TapasForQuestionAnswering` gives nan loss here)
>                   self.assertEqual(tf_nans, pt_nans)
E                   AssertionError: array(False) != array(True)

gante · 2022-01-25T17:16:14Z

I make this PR as draft again, because there are other potential PT/TF inconsistency uncaught. (The test inputs are randomly generated, so we don't always have the same outputs each run)

Having random tests might actually uncover interesting corner cases, but failures risk being overlooked if we can't get the random seed back 🤔 @sgugger , what's our go-to policy with respect to tests -- as deterministic as possible, or is randomness embraced?

sgugger · 2022-01-25T17:22:41Z

It would be best to keep those random indeed.

ydshieh · 2022-01-25T17:25:34Z

It would be best to keep those random indeed.

We can discuss about this for another PR. Here the issues I can identify are more deterministic -- one issue is coming from a sequence with all attention mask being 0 (although randomly generated in this case).

So for that particular case, we can just add an extra test case (with such an attention mask).

See #15326 if you are interested - but it is not TF related.

patrickvonplaten · 2022-01-31T11:36:47Z

Woow, great catch @ydshieh !

ydshieh · 2022-01-31T21:23:57Z

Hi, I reverted the change in test_pt_tf_model_equivalence - it requires some time to fix the issues I could find. I don't want to this PR being blocked due to this new version of test. If you feel OK, we can merge this PR for now. Thank you!

gante · 2022-02-01T11:04:31Z

As always, thank you @ydshieh, this was a great contribution. Will you be working on the improved equivalence test? (If not, I might pick it up :) )

ydshieh · 2022-02-01T11:06:28Z

As always, thank you @ydshieh, this was a great contribution. Will you be working on the improved equivalence test? (If not, I might pick it up :) )

I already have one version (that is used for hunting the bugs :-) ). Would definitely open a PR to include it once other issues are fixed (so the test won't fail)

ydshieh · 2022-02-01T11:09:39Z

basically, the new version checks in a recursively way:

tuple --> check each element inside it
tensor --> check shape, and the abs max diff of tensor elements

A common issue is that the tuples returned by PT/TF have different length. Another one is the tensors have different shape.

If you want to go ahead, I am fine with it though :-)

gante · 2022-02-01T11:15:20Z

No, by all means, we are very grateful that you are working on it ❤️

* Fix TF Causal LM models' returned logits * Fix expected shape in the tests Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

ydshieh changed the title ~~[WIP] Fix TF Causal LM models' returned logits~~ Fix TF Causal LM models' returned logits Jan 22, 2022

ydshieh marked this pull request as ready for review January 22, 2022 09:44

ydshieh mentioned this pull request Jan 22, 2022

Fix loss calculation in TFXXXForTokenClassification models #15294

Merged

Rocketknight1 mentioned this pull request Jan 24, 2022

Fix the inconsistency of loss calculation between PT/TF XLNetLMHeadModel #15298

Merged

gante reviewed Jan 24, 2022

View reviewed changes

ydshieh force-pushed the fix_tf_causal_lm_returned_logits branch from b4afeb8 to d3c28dd Compare January 25, 2022 07:25

ydshieh marked this pull request as draft January 25, 2022 08:54

ydshieh mentioned this pull request Jan 25, 2022

Improve DistilBert attn mask #15326

Closed

This was referenced Jan 25, 2022

Fix missing eps arg for LayerNorm in ElectraGeneratorPredictions #15332

Merged

Fix TFLEDModel #15356

Merged

ydshieh added 9 commits January 31, 2022 22:15

Fix TF Causal LM models' returned logits

e509fd0

Fix expected shape in the tests

8e6d093

check the case with labels in test_pt_tf_model_equivalence

23caaf6

Fix

8300aaa

Fix

9789c7e

Fix

888affe

Fix

5501102

Fix

ca887af

Fix nan and increase the tolerance

c6a2a65

ydshieh added 8 commits January 31, 2022 22:16

Update comment

ab38212

more general case

338cf0d

remove comment

879fe00

lower threshold to find more issues

bc1bc37

display model class when error occurs

339e0cb

clean up

1043e80

cleanup

a41849b

revert the change in test_pt_tf_model_equivalence

6a967f1

ydshieh force-pushed the fix_tf_causal_lm_returned_logits branch from a36ed62 to 6a967f1 Compare January 31, 2022 21:20

ydshieh marked this pull request as ready for review January 31, 2022 21:21

gante approved these changes Feb 1, 2022

View reviewed changes

gante merged commit dc05dd5 into huggingface:master Feb 1, 2022

ydshieh mentioned this pull request Mar 5, 2022

Make TF pt-tf equivalence test more aggressive #15839

Merged

ydshieh deleted the fix_tf_causal_lm_returned_logits branch May 5, 2022 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TF Causal LM models' returned logits #15256

Fix TF Causal LM models' returned logits #15256

ydshieh commented Jan 20, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 20, 2022 •

edited

Loading

LysandreJik commented Jan 24, 2022

Rocketknight1 commented Jan 24, 2022

Rocketknight1 commented Jan 24, 2022

ydshieh commented Jan 24, 2022

Rocketknight1 commented Jan 24, 2022

gante left a comment

gante Jan 24, 2022

ydshieh Jan 24, 2022

ydshieh Jan 24, 2022 •

edited

Loading

gante Jan 25, 2022

ydshieh Jan 25, 2022

ydshieh Jan 25, 2022

gante Jan 24, 2022

ydshieh Jan 24, 2022

ydshieh commented Jan 25, 2022

gante commented Jan 25, 2022

sgugger commented Jan 25, 2022

ydshieh commented Jan 25, 2022 •

edited

Loading

patrickvonplaten commented Jan 31, 2022

ydshieh commented Jan 31, 2022

gante commented Feb 1, 2022

ydshieh commented Feb 1, 2022

ydshieh commented Feb 1, 2022 •

edited

Loading

gante commented Feb 1, 2022

		tf_nans = np.copy(np.isnan(tf_loss))
		pt_nans = np.copy(np.isnan(pt_loss))

Fix TF Causal LM models' returned logits #15256

Fix TF Causal LM models' returned logits #15256

Conversation

ydshieh commented Jan 20, 2022 • edited Loading

What does this PR do?

Details

HuggingFaceDocBuilder commented Jan 20, 2022 • edited Loading

LysandreJik commented Jan 24, 2022

Rocketknight1 commented Jan 24, 2022

Rocketknight1 commented Jan 24, 2022

ydshieh commented Jan 24, 2022

Rocketknight1 commented Jan 24, 2022

gante left a comment

Choose a reason for hiding this comment

gante Jan 24, 2022

Choose a reason for hiding this comment

ydshieh Jan 24, 2022

Choose a reason for hiding this comment

ydshieh Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

gante Jan 25, 2022

Choose a reason for hiding this comment

ydshieh Jan 25, 2022

Choose a reason for hiding this comment

ydshieh Jan 25, 2022

Choose a reason for hiding this comment

gante Jan 24, 2022

Choose a reason for hiding this comment

ydshieh Jan 24, 2022

Choose a reason for hiding this comment

ydshieh commented Jan 25, 2022

gante commented Jan 25, 2022

sgugger commented Jan 25, 2022

ydshieh commented Jan 25, 2022 • edited Loading

patrickvonplaten commented Jan 31, 2022

ydshieh commented Jan 31, 2022

gante commented Feb 1, 2022

ydshieh commented Feb 1, 2022

ydshieh commented Feb 1, 2022 • edited Loading

gante commented Feb 1, 2022

ydshieh commented Jan 20, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 20, 2022 •

edited

Loading

ydshieh Jan 24, 2022 •

edited

Loading

ydshieh commented Jan 25, 2022 •

edited

Loading

ydshieh commented Feb 1, 2022 •

edited

Loading