fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

Sai-Suraj-27 · 2024-08-16T14:44:59Z

What does this PR do?

Fixed the following failing test

transformers/tests/models/codegen/test_tokenization_codegen.py

Line 253 in a27182b

def test_truncation(self):

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@amyeroberts @ArthurZucker

amyeroberts

Thanks for handling @Sai-Suraj-27!

I'm assuming that this used to pass - it would be good if you were able to track down the offending commit which changed this behaviour

Sai-Suraj-27 · 2024-08-16T16:26:08Z

Thanks for handling @Sai-Suraj-27!

I'm assuming that this used to pass - it would be good if you were able to track down the offending commit which changed this behaviour

Thanks for the review @amyeroberts. The whole test function was added in this PR (#17443). From git blame, I see that the test was not modified after this PR.

Edit: Sorry, I forgot to make the commit message such that it triggers the slow tests.

amyeroberts · 2024-08-19T10:58:01Z

Thanks for the review @amyeroberts. The whole test function was added in this PR (#17443). From git blame, I see that the test was not modified after this PR.

Ah, yes, sorry I wasn't clear. I don't mean necessarily the commit when the function was added / modified but the commit when the tests started failing.

The way to do this is:

Checkout the commit when the function / model was first added. git checkout d6b6fb9
Run the tests to see if they pass. If they don't they they've always been wrong and we can just fix like this. If they do pass, they we'll need to run git bisect.
Still on d6b6fb9, we init the git bisect and indicate we're on a good commit. git bisect start then git bisect good.
We then want to mark a known bad commit - so you can just checkout main git checkout main then git bisect bad.
From there, git will bisect the commit history to track down the offending commit. It will checkout in your local to a commit between the known good and bad commits. You should then run the tests. If they pass, you mark this commit as good with git bisect good, if they fail git bisect bad. You'll end up on the commit when this issue first occurred within a few iterations. The only issue will be if there's more than one offending commit between the initially marked good and bad ones.

Let me know if this wasn't clear or if I can help at all

HuggingFaceDocBuilderDev · 2024-08-19T11:09:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sai-Suraj-27 · 2024-08-19T15:09:34Z

The way to do this is:

* Checkout the commit when the function / model was first added. `git checkout d6b6fb9`

* Run the tests to see if they pass. If they don't they they've always been wrong and we can just fix like this. If they do pass, they we'll need to run git bisect.

* Still on [d6b6fb9](https://github.com/huggingface/transformers/commit/d6b6fb9963e094216daa30ebf61224ca1c46921e), we init the git bisect and indicate we're on a good commit. `git bisect start` then `git bisect good`.

* We then want to mark a known bad commit - so you can just checkout main `git checkout main` then `git bisect bad`.

* From there, git will bisect the commit history to track down the offending commit. It will checkout in your local to a commit between the known good and bad commits. You should then run the tests. If they pass, you mark this commit as good with `git bisect good`, if they fail `git bisect bad`. You'll end up on the commit when this issue first occurred within a few iterations. The only issue will be if there's more than one offending commit between the initially marked good and bad ones.

Let me know if this wasn't clear or if I can help at all

Thanks for a such a detailed explanation @amyeroberts. I used git bisect to find the first bad commit that broke the test. It was initially passing here in the PR.

So, this was the PR that broke the test, I also confirmed this by running the test on the commit right before merging this PR which is this one, and the test was passing here.

amyeroberts · 2024-08-19T15:17:57Z

Thanks for investigating @Sai-Suraj-27 and for finding the critical commit!

OK, the change looks OK, let's just confirm with @ArthurZucker that this change in tokenization is expected / allowed following #26570

ArthurZucker

thanks for the fix

ArthurZucker · 2024-08-27T07:20:41Z

tests/models/codegen/test_tokenization_codegen.py


        input_ids = tokenizer.encode(text)
        truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"]
        decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern)
-        self.assertEqual(decoded_text, expected_trucated_text)
+        self.assertEqual(decoded_text, expected_truncated_text)


TBH this looks good to me! Given that the truncation pattern does not include \n as an individual character, it's a bug fix!

…gingface#32850) * Fixed failing CodeGenTokenizationTest::test_truncation. * [run_slow] Codegen * [run_slow] codegen

Fixed failing CodeGenTokenizationTest::test_truncation.

2c695ca

amyeroberts reviewed Aug 16, 2024

View reviewed changes

Sai-Suraj-27 added 2 commits August 16, 2024 22:10

[run_slow] Codegen

e38e148

[run_slow] codegen

82b8041

ArthurZucker approved these changes Aug 27, 2024

View reviewed changes

ArthurZucker merged commit 3bf6dd8 into huggingface:main Aug 27, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

Sai-Suraj-27 commented Aug 16, 2024

amyeroberts left a comment

Sai-Suraj-27 commented Aug 16, 2024 •

edited

Loading

amyeroberts commented Aug 19, 2024

HuggingFaceDocBuilderDev commented Aug 19, 2024

Sai-Suraj-27 commented Aug 19, 2024

amyeroberts commented Aug 19, 2024

ArthurZucker left a comment

ArthurZucker Aug 27, 2024

fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

Conversation

Sai-Suraj-27 commented Aug 16, 2024

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment

Choose a reason for hiding this comment

Sai-Suraj-27 commented Aug 16, 2024 • edited Loading

amyeroberts commented Aug 19, 2024

HuggingFaceDocBuilderDev commented Aug 19, 2024

Sai-Suraj-27 commented Aug 19, 2024

amyeroberts commented Aug 19, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Aug 27, 2024

Choose a reason for hiding this comment

Sai-Suraj-27 commented Aug 16, 2024 •

edited

Loading