-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TF pt-tf equivalence test more aggressive #15839
Make TF pt-tf equivalence test more aggressive #15839
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thorough tests <3 and +1 for leaving the plan as comments, makes it easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good. If I understand correctly, it makes the following major changes:
-
Much smaller tolerances for differences between PT + TF outputs
Yes
-
Verifies that all output keys are the same across both models.
Yes
-
Crossloading is done in-memory instead of saving and loading a checkpoint:
Yes (it was done this way previously)
Is that correct? Is there any other important parts that I missed?
It's correct, and no you didn't miss anything. But just a remark: we test 2 cases: labels
passed to model / labels
not passed to model.
(previously, we only test the case without passing labels
)
tfo[pt_nans] = 0 | ||
|
||
max_diff = np.amax(np.abs(tfo - pto)) | ||
self.assertLessEqual(max_diff, 1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we need some kind of relative tolerance here? 1e-5 is a small allowable difference for potentially large values!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes. Currently I have confidence that 1e-5
would work in our current testing configs/models.
The weights are initialized with a std 0.02 -> and with this setting, the output values won't get too large.
(Lysandre told me that once we go for GPU testing and fp16 precision testing, we might need to deal with larger errors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would love to know if this passes on GPU as it's typically tougher on small differences.
The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.
Here we do the init ourselves, so it doesn't apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your time spent making this better! Two questions:
- Does it run on GPU?
- How long does it take to run? Our test suite is already taking a significant time to run, so aiming for common tests that run as fast as possible is important.
There's quite a bit of model-specific logic which I'm not particularly enthusiastic about (pre-training models + convnext), but I understand why it's rigorous to do it like that here.
tfo[pt_nans] = 0 | ||
|
||
max_diff = np.amax(np.abs(tfo - pto)) | ||
self.assertLessEqual(max_diff, 1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would love to know if this passes on GPU as it's typically tougher on small differences.
The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.
Here we do the init ourselves, so it doesn't apply.
For now, I haven't tested it with GPU. I can run it on the office's GPU machine this week (a good chance to learn how to connect to those machine!)
Let me measure the timing of this test with the current master and with this PR. Will report it.
(Yeah, once we fix all the inconsistency, we can remove all these exceptional conditions.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing those new tests!
Good news! Testing on single GPU with the small tolerance I will address a few style review suggestions. After that, I think it's ready to merge ..? |
It's weird that it took more time when you expected it to take less, no? Can you try running the test suite with |
31fdfdc
to
32fb03d
Compare
After a thorough verification and a few fixes, this PR is ready (again) from my side. I would love @sgugger (and @LysandreJik when he is back) to check it again, and @gante & @Rocketknight1 if they want (no particular TF-related change compared to the last commit). The following summary might save some review (again) time
I run this new version of test, and with the small tolerance In order to be super sure, I also ran it on CPU/GPU for 100 times (very aggressive 🔥 🔥 )! All models passed for this test, except for:
Regarding the running time, I will measure it and post the results in the next comment. |
42ebc89
to
e9555ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still good to me! Left two nits.
tests/test_modeling_tf_common.py
Outdated
# (this will fail for `TFWav2Vec2Model`) | ||
# attention_mask = tf.concat( | ||
# [ | ||
# tf.zeros_like(attention_mask[:1], dtype=tf.int32), | ||
# tf.cast(attention_mask[1:], dtype=tf.int32) | ||
# ], | ||
# axis=0 | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be cleaned up then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block should be activated in the future:
- we want to clean up (next steps) the large negative values like
1e-9
etc. for attention mask - currently, this block will fail due to the different such values
- once 1.) is done, we should enable this block to make sure no regression & the new models work in the same manner as the existing ones
- (If you prefer, I can remove this block in this PR, and add it back once we are ready for it)
(I probably said previously in the wrong way - this block is intended to be kept (and enabled) rather than being removed, sorry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the plan is to bring it back, by all means leave it! But make it clear in the comment :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Sir!
Done as follows
transformers/tests/test_modeling_tf_common.py
Lines 548 to 555 in a6c7157
# make sure no all 0s attention masks - to avoid failure at this moment. | |
# TODO: remove this line once the TODO below is implemented. | |
attention_mask = tf.ones_like(attention_mask, dtype=tf.int32) | |
# Here we make the first sequence with all 0s as attention mask. | |
# Currently, this will fail for `TFWav2Vec2Model`. This is caused by the different large negative | |
# values, like `1e-4`, `1e-9`, `1e-30` and `-inf` for attention mask across models/frameworks. | |
# TODO: enable this block once the large negative values thing is cleaned up. | |
# (see https://github.com/huggingface/transformers/issues/14859) |
(also changed the remaining tfo/pto to tf_output/pt_output)
I think I made a mistake that |
After using Tesla T4 GPU (same as for the CI GPU testings), I confirmed that this aggressive PT/TF equivalence test passes with I ran this test 1000 times (on GPU) for each TF models. Think it's ready if @LysandreJik is happy with the slightly(?) increased running time. (I saved all the differences. I can share the values if you would like to see them!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, thank you for working on it @ydshieh!
a6c7157
to
2e43334
Compare
What does this PR do?
Make TF pt-tf equivalence test more aggressive.
After a series of fixes done so far, I think it is a good time to include this more aggressive testing in
master
branch.(Otherwise, the new models added might have undetected issues. For example, the recent
TFConvNextModel
would havehidden_states
not transposed to match Pytorch version issue - I tested it on my local branch and informed the author to fix it).There are still 3 categories of PT/TF inconsistency to address, but they are less urgent in my opinion. See below.
Currently, the test makes a few exception to not test these 3 cases (in order to get green test) - I add
TODO
comments in the code.TF: @Rocketknight1 @gante
Test: @LysandreJik @sgugger
TODO in separate PRs