-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Open
Labels
Description
System Info
transformersversion: 4.57.3/5.0.0rc1- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
Who can help?
@ArthurZucker and @itazap
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Code sample:
from transformers import AutoTokenizer, __version__ as hf_version
checkpoint = "google/gemma-2-9b-it"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
text_input = "beginning , and"
ids = hf_tokenizer(text_input, return_tensors="np").input_ids
hf_detokenized = hf_tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(f"{hf_version}: `{hf_detokenized}`")Outputs:
4.57.3: `beginning, and`
5.0.0rc1: `beginning , and`
TokenizersBackend class (ex PreTrainedTokenizerFast) has no default clean_up_tokenization method implementation, which previously was in the PreTrainedTokenizerBase. This changes behavior for tokenizers that don't provide it's own method, like GemmaTokenizer.
PythonBackend class provides default behavior in the ._decode method, but not default clean_up_tokenization implementation, which might also be a problem.
Expected behavior
clean_up_tokenization_spacesflag in the.decodemethod works as in v4 version- No changes in detokenization behavior for tokenizers that has
self.clean_up_tokenization_spaces=True