Closed
Description
System Info
transformers
version: 4.36.2- Platform: Linux-6.2.0-25-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Load a non-fast Tokenizer for mBART
- Add an additional special token to it
- Encode and then decode input containing previously added special token
from transformers import MBart50Tokenizer
tk = MBart50Tokenizer.from_pretrained('facebook/mbart-large-50')
tk.add_tokens('<token>', True)
print(tk.decode(tk("This is my example sentence with a special <token> token")["input_ids"]))
>>> 'en_XXThis is my example sentence with a special <token> token</s>'
This differs from the fast tokenizers' decoding scheme, as it will correctly decode the input with a space after en_XX
. I believe this is due to the implementation for legacy_added_tokens
in
transformers/src/transformers/tokenization_utils.py
Lines 1002 to 1022 in 3cefac1
and more specifically the second part of the set definition for
legacy_added_tokens
that accounts for special tokens that have been added manually after loading (?)
When disabling the special handling for legacy_added_tokens
, the tokenization output would be correct, so I was primarily wondering for what reason this was added and whether removing this would potentially break other tokenizers.
Expected behavior
fast_tk = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50')
fast_tk.add_tokens('<token>', True)
print(fast_tk.decode(fast_tk("This is my example sentence with a special <token> token")["input_ids"])))
>>> 'en_XX This is my example sentence with a special <token> token</s>'
The decoding should match the fast tokenizers' output (?), at least I would assume so.
Metadata
Metadata
Assignees
Labels
No labels