Skip to content

T5Tokenizer Fast and Slow give different results with AddedTokens #16334

Closed
@patrickvonplaten

Description

@patrickvonplaten

When adding a new token to T5TokenizerFast and/or T5Tokenizer, we get different results for the tokenizers which is unexpected.

E.g. running the following code:

from transformers import AutoTokenizer, AddedToken

tok = AutoTokenizer.from_pretrained("t5-small", use_fast=False)
tok_fast = AutoTokenizer.from_pretrained("t5-small", use_fast=True)

tok.add_tokens("$$$")
tok_fast.add_tokens(AddedToken("$$$", lstrip=False))

prompt = "Hello what is going on $$$ no ? We should"

print("Slow")
print(tok.decode(tok(prompt).input_ids))

print("Fast")
print(tok_fast.decode(tok_fast(prompt).input_ids))

yields different results for each tokenizer

Slow
Hello what is going on $$$ no? We should</s>
Fast
Hello what is going on$$$ no? We should</s>

Environment info

  • transformers version: 4.18.0.dev0
  • Platform: Linux-5.15.15-76051515-generic-x86_64-with-glibc2.34
  • Python version: 3.9.7
  • Huggingface_hub version: 0.4.0.dev0
  • PyTorch version (GPU?): 1.10.2+cu102 (True)
  • Tensorflow version (GPU?): 2.8.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.4.0 (cpu)
  • Jax version: 0.3.1
  • JaxLib version: 0.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions