Skip to content

clean_up_tokenization_spaces behavior changes in v5 #42898

@apaniukov

Description

@apaniukov

System Info

  • transformers version: 4.57.3/5.0.0rc1
  • Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code sample:

from transformers import AutoTokenizer, __version__ as hf_version

checkpoint = "google/gemma-2-9b-it"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

text_input = "beginning , and"    
ids = hf_tokenizer(text_input, return_tensors="np").input_ids
hf_detokenized = hf_tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(f"{hf_version}: `{hf_detokenized}`")

Outputs:

4.57.3:   `beginning, and`
5.0.0rc1: `beginning , and`

TokenizersBackend class (ex PreTrainedTokenizerFast) has no default clean_up_tokenization method implementation, which previously was in the PreTrainedTokenizerBase. This changes behavior for tokenizers that don't provide it's own method, like GemmaTokenizer.

PythonBackend class provides default behavior in the ._decode method, but not default clean_up_tokenization implementation, which might also be a problem.

Expected behavior

  1. clean_up_tokenization_spaces flag in the .decode method works as in v4 version
  2. No changes in detokenization behavior for tokenizers that has self.clean_up_tokenization_spaces=True

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions