Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TF pt-tf equivalence test more aggressive #15839

Merged
merged 19 commits into from
Mar 14, 2022

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Feb 26, 2022

What does this PR do?

Make TF pt-tf equivalence test more aggressive.

After a series of fixes done so far, I think it is a good time to include this more aggressive testing in master branch.
(Otherwise, the new models added might have undetected issues. For example, the recent TFConvNextModel would have hidden_states not transposed to match Pytorch version issue - I tested it on my local branch and informed the author to fix it).

There are still 3 categories of PT/TF inconsistency to address, but they are less urgent in my opinion. See below.
Currently, the test makes a few exception to not test these 3 cases (in order to get green test) - I add TODO comments in the code.

TF: @Rocketknight1 @gante
Test: @LysandreJik @sgugger

TODO in separate PRs

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ydshieh ydshieh marked this pull request as ready for review February 28, 2022 11:29
@ydshieh ydshieh changed the title [WIP] Make TF pt-tf equivalence test more aggressive Make TF pt-tf equivalence test more aggressive Feb 28, 2022
Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thorough tests <3 and +1 for leaving the plan as comments, makes it easier to review.

tests/test_modeling_tf_common.py Show resolved Hide resolved
Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good. If I understand correctly, it makes the following major changes:

  1. Much smaller tolerances for differences between PT + TF outputs

    Yes

  2. Verifies that all output keys are the same across both models.

    Yes

  3. Crossloading is done in-memory instead of saving and loading a checkpoint:

    Yes (it was done this way previously)

Is that correct? Is there any other important parts that I missed?

It's correct, and no you didn't miss anything. But just a remark: we test 2 cases: labels passed to model / labels not passed to model.
(previously, we only test the case without passing labels)

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
tfo[pt_nans] = 0

max_diff = np.amax(np.abs(tfo - pto))
self.assertLessEqual(max_diff, 1e-5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we need some kind of relative tolerance here? 1e-5 is a small allowable difference for potentially large values!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes. Currently I have confidence that 1e-5 would work in our current testing configs/models.
The weights are initialized with a std 0.02 -> and with this setting, the output values won't get too large.
(Lysandre told me that once we go for GPU testing and fp16 precision testing, we might need to deal with larger errors).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would love to know if this passes on GPU as it's typically tougher on small differences.

The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.

Here we do the init ourselves, so it doesn't apply.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your time spent making this better! Two questions:

  • Does it run on GPU?
  • How long does it take to run? Our test suite is already taking a significant time to run, so aiming for common tests that run as fast as possible is important.

There's quite a bit of model-specific logic which I'm not particularly enthusiastic about (pre-training models + convnext), but I understand why it's rigorous to do it like that here.

tfo[pt_nans] = 0

max_diff = np.amax(np.abs(tfo - pto))
self.assertLessEqual(max_diff, 1e-5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would love to know if this passes on GPU as it's typically tougher on small differences.

The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.

Here we do the init ourselves, so it doesn't apply.

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 28, 2022

Thank you for your time spent making this better! Two questions:

* Does it run on GPU?

For now, I haven't tested it with GPU. I can run it on the office's GPU machine this week (a good chance to learn how to connect to those machine!)

* How long does it take to run? Our test suite is already taking a significant time to run, so aiming for common tests that run as fast as possible is important.

Let me measure the timing of this test with the current master and with this PR. Will report it.

There's quite a bit of model-specific logic which I'm not particularly enthusiastic about (pre-training models + convnext), but I understand why it's rigorous to do it like that here.

(Yeah, once we fix all the inconsistency, we can remove all these exceptional conditions.)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing those new tests!

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 1, 2022

Good news! Testing on single GPU with the small tolerance 1e-5 still works! All models pass test_pt_tf_model_equivalence.
(I forgot to install CUDA driver - and I'm glad I double-checked and fixed this silly mistake :-) )

I will address a few style review suggestions. After that, I think it's ready to merge ..?

@LysandreJik
Copy link
Member

LysandreJik commented Mar 1, 2022

It's weird that it took more time when you expected it to take less, no? Can you try running the test suite with --durations=0 to see all the tests and the time it took for them to run?

@ydshieh ydshieh marked this pull request as draft March 3, 2022 15:41
@ydshieh ydshieh changed the title Make TF pt-tf equivalence test more aggressive [WIP] Make TF pt-tf equivalence test more aggressive Mar 3, 2022
@ydshieh ydshieh force-pushed the aggressive_pt_tf_equiv_test branch from 31fdfdc to 32fb03d Compare March 4, 2022 17:08
@ydshieh ydshieh changed the title [WIP] Make TF pt-tf equivalence test more aggressive Make TF pt-tf equivalence test more aggressive Mar 4, 2022
@ydshieh ydshieh marked this pull request as ready for review March 4, 2022 17:47
@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 5, 2022

After a thorough verification and a few fixes, this PR is ready (again) from my side.

I would love @sgugger (and @LysandreJik when he is back) to check it again, and @gante & @Rocketknight1 if they want (no particular TF-related change compared to the last commit).

The following summary might save some review (again) time

  • Main fixes (since the last review):

    • In the last commit, the following line forgot to use to_tuple(), and the method check_output (previous version) only deal with tuple or tensor. So half test cases passed just because no check was performed at all. (my bad ...).

      check_outputs(tfo.to_tuple(), pto.to_tuple(), model_class, names=tf_keys)

      This is fixed now, and also check_output now raises error if a result is not tested (if its type is not in [tuple, list, tensor]).

    • In the last commit, I accidentally deleted the following test case (check the results after save/load checkpoints, which exists on the current master).

      # Check we can load pt model in tf and vice-versa with checkpoint => model functions

      This is added back now.

    • Make the test also run on GPU (put models/inputs on the correct device)

  • Style changes

    • Since I have to add back the accidentally deleted test cases , I added check_pt_tf_models to avoid duplicated large code block.
    • tfo -> tf_outputs and pto -> pt_outputs
    • no need to use np.copy in here, which was discussed in a comment.

I run this new version of test, and with the small tolerance 1e-5, it pass both on CPU and GPU 🎉!

In order to be super sure, I also ran it on CPU/GPU for 100 times (very aggressive 🔥 🔥 )! All models passed for this test, except for:

  • TFSpeech2TextModel: this is addressed in Set scale_embedding to False in some TF tests #15952
  • (TFHubertModel): There is 1 among 300 runs on CPU where I got a diff > 1e-5. I ran it 10000 times again, and still got only 1 such occurrence. (Guess it's OK to set 1e-5 in this case.)

Regarding the running time, I will measure it and post the results in the next comment.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 5, 2022

In terms of running time:

  • Circle CI

  • GCP VM (with -n 1 + CPU)

    • current : 178.75s
    • this PR : 200.81s

These suggest an increase by roughly 10%.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good to me! Left two nits.

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved
Comment on lines 552 to 562
# (this will fail for `TFWav2Vec2Model`)
# attention_mask = tf.concat(
# [
# tf.zeros_like(attention_mask[:1], dtype=tf.int32),
# tf.cast(attention_mask[1:], dtype=tf.int32)
# ],
# axis=0
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be cleaned up then?

Copy link
Collaborator Author

@ydshieh ydshieh Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block should be activated in the future:

  1. we want to clean up (next steps) the large negative values like 1e-9 etc. for attention mask
  2. currently, this block will fail due to the different such values
  3. once 1.) is done, we should enable this block to make sure no regression & the new models work in the same manner as the existing ones
  4. (If you prefer, I can remove this block in this PR, and add it back once we are ready for it)

(I probably said previously in the wrong way - this block is intended to be kept (and enabled) rather than being removed, sorry)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is to bring it back, by all means leave it! But make it clear in the comment :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Sir!

Done as follows

# make sure no all 0s attention masks - to avoid failure at this moment.
# TODO: remove this line once the TODO below is implemented.
attention_mask = tf.ones_like(attention_mask, dtype=tf.int32)
# Here we make the first sequence with all 0s as attention mask.
# Currently, this will fail for `TFWav2Vec2Model`. This is caused by the different large negative
# values, like `1e-4`, `1e-9`, `1e-30` and `-inf` for attention mask across models/frameworks.
# TODO: enable this block once the large negative values thing is cleaned up.
# (see https://github.com/huggingface/transformers/issues/14859)

(also changed the remaining tfo/pto to tf_output/pt_output)

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 8, 2022

I think I made a mistake that TF 2.8 doesn't work with CUDA 11.4 (installed on my GCP VM machine) K80 GPU doesn't work well with Ubuntu 20.04 (regarding the drivers and CUDA), and it fallbacks to CPU instead. I will fix this and re-run the tests.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 11, 2022

After using Tesla T4 GPU (same as for the CI GPU testings), I confirmed that this aggressive PT/TF equivalence test passes with 1e-5 when running on GPU!

I ran this test 1000 times (on GPU) for each TF models.

Think it's ready if @LysandreJik is happy with the slightly(?) increased running time.

(I saved all the differences. I can share the values if you would like to see them!)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, thank you for working on it @ydshieh!

@ydshieh ydshieh force-pushed the aggressive_pt_tf_equiv_test branch from a6c7157 to 2e43334 Compare March 14, 2022 11:11
@ydshieh ydshieh merged commit 923c35b into huggingface:master Mar 14, 2022
@ydshieh ydshieh deleted the aggressive_pt_tf_equiv_test branch May 5, 2022 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants