Closed
Description
When adding a new token to T5TokenizerFast and/or T5Tokenizer, we get different results for the tokenizers which is unexpected.
E.g. running the following code:
from transformers import AutoTokenizer, AddedToken
tok = AutoTokenizer.from_pretrained("t5-small", use_fast=False)
tok_fast = AutoTokenizer.from_pretrained("t5-small", use_fast=True)
tok.add_tokens("$$$")
tok_fast.add_tokens(AddedToken("$$$", lstrip=False))
prompt = "Hello what is going on $$$ no ? We should"
print("Slow")
print(tok.decode(tok(prompt).input_ids))
print("Fast")
print(tok_fast.decode(tok_fast(prompt).input_ids))
yields different results for each tokenizer
Slow
Hello what is going on $$$ no? We should</s>
Fast
Hello what is going on$$$ no? We should</s>
Environment info
transformers
version: 4.18.0.dev0- Platform: Linux-5.15.15-76051515-generic-x86_64-with-glibc2.34
- Python version: 3.9.7
- Huggingface_hub version: 0.4.0.dev0
- PyTorch version (GPU?): 1.10.2+cu102 (True)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.4.0 (cpu)
- Jax version: 0.3.1
- JaxLib version: 0.3.0
Metadata
Metadata
Assignees
Labels
No labels