-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for handling @Sai-Suraj-27!
I'm assuming that this used to pass - it would be good if you were able to track down the offending commit which changed this behaviour
Thanks for the review @amyeroberts. The whole test function was added in this PR (#17443). From git blame, I see that the test was not modified after this PR. Edit: Sorry, I forgot to make the commit message such that it triggers the slow tests. |
Ah, yes, sorry I wasn't clear. I don't mean necessarily the commit when the function was added / modified but the commit when the tests started failing. The way to do this is:
Let me know if this wasn't clear or if I can help at all |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for a such a detailed explanation @amyeroberts. I used
|
Thanks for investigating @Sai-Suraj-27 and for finding the critical commit! OK, the change looks OK, let's just confirm with @ArthurZucker that this change in tokenization is expected / allowed following #26570 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix
|
||
input_ids = tokenizer.encode(text) | ||
truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"] | ||
decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern) | ||
self.assertEqual(decoded_text, expected_trucated_text) | ||
self.assertEqual(decoded_text, expected_truncated_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH this looks good to me! Given that the truncation pattern does not include \n
as an individual character, it's a bug fix!
…gingface#32850) * Fixed failing CodeGenTokenizationTest::test_truncation. * [run_slow] Codegen * [run_slow] codegen
…gingface#32850) * Fixed failing CodeGenTokenizationTest::test_truncation. * [run_slow] Codegen * [run_slow] codegen
…gingface#32850) * Fixed failing CodeGenTokenizationTest::test_truncation. * [run_slow] Codegen * [run_slow] codegen
What does this PR do?
Fixed the following failing test
transformers/tests/models/codegen/test_tokenization_codegen.py
Line 253 in a27182b
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts @ArthurZucker