Fix: Set clean_up_tokenization_spaces #42900

Aznix07 · 2025-12-16T12:27:43Z

What does this PR do?

This PR fixes a regression where clean_up_tokenization_spaces default was changed from True to False in v5.0.0rc1, breaking backward compatibility with v4.x.

Problem:

In transformers v5.0.0rc1, tokenizers no longer clean up spaces before punctuation by default:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
text_input = "beginning , and"    
ids = tokenizer(text_input).input_ids
decoded = tokenizer.decode(ids, skip_special_tokens=True)

# v4.57.3:   "beginning, and"  (spaces cleaned)
# v5.0.0rc1: "beginning , and" (spaces NOT cleaned)

Solution:

Changed the default value in tokenization_utils_base.py (line 1420):

- self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", False)
+ self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

Safety:
This change only affects post-processing of decoded text, NOT:

Token IDs sent to the model (verified unchanged)
Model predictions
Training behavior

Fixes #42898

Before submitting

Did you read the [contributor guideline]
Was this discussed/approved via a Github issue or the forum?

Who can review?

@ArthurZucker @itazap

github-actions · 2025-12-16T13:27:49Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42900&sha=aeba93

Aznix07 · 2025-12-16T13:41:22Z

I tried fixing this on main by changing the default clean_up_tokenization_spaces to True, but tokenization tests (e.g. Wav2Vec2, CLVP) expect the default to remain False in v5 and fail accordingly.

From the issue description, the real problem seems to be in the new v5 tokenization backends (TokenizersBackend / PythonBackend), where decode(..., clean_up_tokenization_spaces=True) doesn’t call a default clean_up_tokenization implementation.

Could you confirm:

Should the fix target the v5 branch instead of main?
Is the expected behavior in v5: default clean_up_tokenization_spaces=False globally, but decode(..., clean_up_tokenization_spaces=True) must always apply cleanup logic?

I’d be happy to move my work to the v5 branch and implement this properly once I know which files/classes to target.

Fix: Set clean_up_tokenization_spaces

aeba93b

Aznix07 force-pushed the fix-clean-up-tokenization-spaces branch from e93e894 to aeba93b Compare December 16, 2025 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Set clean_up_tokenization_spaces #42900

Fix: Set clean_up_tokenization_spaces #42900

Uh oh!

Aznix07 commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

Aznix07 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: Set clean_up_tokenization_spaces #42900

Are you sure you want to change the base?

Fix: Set clean_up_tokenization_spaces #42900

Uh oh!

Conversation

Aznix07 commented Dec 16, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

Aznix07 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant