`clean_up_tokenization_spaces` behavior changes in v5

### System Info

- `transformers` version: 4.57.3/5.0.0rc1
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.10.12

### Who can help?

@ArthurZucker and @itazap

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Code sample:
```python
from transformers import AutoTokenizer, __version__ as hf_version

checkpoint = "google/gemma-2-9b-it"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

text_input = "beginning , and"    
ids = hf_tokenizer(text_input, return_tensors="np").input_ids
hf_detokenized = hf_tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(f"{hf_version}: `{hf_detokenized}`")
```
Outputs:
```
4.57.3:   `beginning, and`
5.0.0rc1: `beginning , and`
```

`TokenizersBackend` class (ex `PreTrainedTokenizerFast`) has no default `clean_up_tokenization` method implementation, which previously was in the `PreTrainedTokenizerBase`. This changes behavior for tokenizers that don't provide it's own method, like `GemmaTokenizer`.

`PythonBackend` class provides default behavior in the `._decode` method, but not default `clean_up_tokenization` implementation, which might also be a problem.

### Expected behavior

1. `clean_up_tokenization_spaces` flag in the `.decode` method works as in v4 version
2. No changes in detokenization behavior for tokenizers that has `self.clean_up_tokenization_spaces=True`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`clean_up_tokenization_spaces` behavior changes in v5 #42898

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clean_up_tokenization_spaces behavior changes in v5 #42898

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`clean_up_tokenization_spaces` behavior changes in v5 #42898