Skip to content

Conversation

@Aznix07
Copy link
Contributor

@Aznix07 Aznix07 commented Dec 16, 2025

What does this PR do?

This PR fixes a regression where clean_up_tokenization_spaces default was changed from True to False in v5.0.0rc1, breaking backward compatibility with v4.x.

Problem:

In transformers v5.0.0rc1, tokenizers no longer clean up spaces before punctuation by default:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
text_input = "beginning , and"    
ids = tokenizer(text_input).input_ids
decoded = tokenizer.decode(ids, skip_special_tokens=True)

# v4.57.3:   "beginning, and"  (spaces cleaned)
# v5.0.0rc1: "beginning , and" (spaces NOT cleaned)

Solution:

Changed the default value in tokenization_utils_base.py (line 1420):

- self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", False)
+ self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

Safety:
This change only affects post-processing of decoded text, NOT:

  • Token IDs sent to the model (verified unchanged)
  • Model predictions
  • Training behavior

Fixes #42898

Before submitting

  • Did you read the [contributor guideline]
  • Was this discussed/approved via a Github issue or the forum?

Who can review?

@ArthurZucker @itazap

@Aznix07 Aznix07 force-pushed the fix-clean-up-tokenization-spaces branch from e93e894 to aeba93b Compare December 16, 2025 13:18
@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42900&sha=aeba93

@Aznix07
Copy link
Contributor Author

Aznix07 commented Dec 16, 2025

I tried fixing this on main by changing the default clean_up_tokenization_spaces to True, but tokenization tests (e.g. Wav2Vec2, CLVP) expect the default to remain False in v5 and fail accordingly.

From the issue description, the real problem seems to be in the new v5 tokenization backends (TokenizersBackend / PythonBackend), where decode(..., clean_up_tokenization_spaces=True) doesn’t call a default clean_up_tokenization implementation.

Could you confirm:

  • Should the fix target the v5 branch instead of main?
  • Is the expected behavior in v5: default clean_up_tokenization_spaces=False globally, but decode(..., clean_up_tokenization_spaces=True) must always apply cleanup logic?

I’d be happy to move my work to the v5 branch and implement this properly once I know which files/classes to target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

clean_up_tokenization_spaces behavior changes in v5

1 participant