Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

Merged
merged 3 commits into from
Aug 27, 2024

Conversation

Sai-Suraj-27
Copy link
Contributor

What does this PR do?

Fixed the following failing test

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@amyeroberts @ArthurZucker

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling @Sai-Suraj-27!

I'm assuming that this used to pass - it would be good if you were able to track down the offending commit which changed this behaviour

@Sai-Suraj-27
Copy link
Contributor Author

Sai-Suraj-27 commented Aug 16, 2024

Thanks for handling @Sai-Suraj-27!

I'm assuming that this used to pass - it would be good if you were able to track down the offending commit which changed this behaviour

Thanks for the review @amyeroberts. The whole test function was added in this PR (#17443). From git blame, I see that the test was not modified after this PR.

Edit: Sorry, I forgot to make the commit message such that it triggers the slow tests.

@amyeroberts
Copy link
Collaborator

Thanks for the review @amyeroberts. The whole test function was added in this PR (#17443). From git blame, I see that the test was not modified after this PR.

Ah, yes, sorry I wasn't clear. I don't mean necessarily the commit when the function was added / modified but the commit when the tests started failing.

The way to do this is:

  • Checkout the commit when the function / model was first added. git checkout d6b6fb9
  • Run the tests to see if they pass. If they don't they they've always been wrong and we can just fix like this. If they do pass, they we'll need to run git bisect.
  • Still on d6b6fb9, we init the git bisect and indicate we're on a good commit. git bisect start then git bisect good.
  • We then want to mark a known bad commit - so you can just checkout main git checkout main then git bisect bad.
  • From there, git will bisect the commit history to track down the offending commit. It will checkout in your local to a commit between the known good and bad commits. You should then run the tests. If they pass, you mark this commit as good with git bisect good, if they fail git bisect bad. You'll end up on the commit when this issue first occurred within a few iterations. The only issue will be if there's more than one offending commit between the initially marked good and bad ones.

Let me know if this wasn't clear or if I can help at all

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Sai-Suraj-27
Copy link
Contributor Author

The way to do this is:

* Checkout the commit when the function / model was first added. `git checkout d6b6fb9`

* Run the tests to see if they pass. If they don't they they've always been wrong and we can just fix like this. If they do pass, they we'll need to run git bisect.

* Still on [d6b6fb9](https://github.com/huggingface/transformers/commit/d6b6fb9963e094216daa30ebf61224ca1c46921e), we init the git bisect and indicate we're on a good commit. `git bisect start` then `git bisect good`.

* We then want to mark a known bad commit - so you can just checkout main `git checkout main` then `git bisect bad`.

* From there, git will bisect the commit history to track down the offending commit. It will checkout in your local to a commit between the known good and bad commits. You should then run the tests. If they pass, you mark this commit as good with `git bisect good`, if they fail `git bisect bad`. You'll end up on the commit when this issue first occurred within a few iterations. The only issue will be if there's more than one offending commit between the initially marked good and bad ones.

Let me know if this wasn't clear or if I can help at all

Thanks for a such a detailed explanation @amyeroberts. I used git bisect to find the first bad commit that broke the test. It was initially passing here in the PR.

image
So, this was the PR that broke the test, I also confirmed this by running the test on the commit right before merging this PR which is this one, and the test was passing here.

@amyeroberts
Copy link
Collaborator

Thanks for investigating @Sai-Suraj-27 and for finding the critical commit!

OK, the change looks OK, let's just confirm with @ArthurZucker that this change in tokenization is expected / allowed following #26570

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix


input_ids = tokenizer.encode(text)
truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"]
decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern)
self.assertEqual(decoded_text, expected_trucated_text)
self.assertEqual(decoded_text, expected_truncated_text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this looks good to me! Given that the truncation pattern does not include \n as an individual character, it's a bug fix!

@ArthurZucker ArthurZucker merged commit 3bf6dd8 into huggingface:main Aug 27, 2024
21 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
…gingface#32850)

* Fixed failing CodeGenTokenizationTest::test_truncation.

* [run_slow] Codegen

* [run_slow] codegen
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
…gingface#32850)

* Fixed failing CodeGenTokenizationTest::test_truncation.

* [run_slow] Codegen

* [run_slow] codegen
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
…gingface#32850)

* Fixed failing CodeGenTokenizationTest::test_truncation.

* [run_slow] Codegen

* [run_slow] codegen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants