-
Notifications
You must be signed in to change notification settings - Fork 29.9k
Description
System Info
transformers
version: 4.29.2- Platform: Linux-5.19.0-1027-aws-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
Who can help?
@ArthurZucker @younesbelkada I am trying to use special tokens with the LlamaTokenizer in Transformers 4.31.0 and with certain configurations of input, the tokenizer is returning a token id of 0 corresponding to the unknown token. For example, I have added the special token "<REPR_END>", and if I pass that through the tokenizer to get [1, 32003] which is good. Additionally if I pass the word "inform" through the tokenizer, I get [1, 1871], which is also good.
However, if I pass "<REPR_END>inform" through the tokenizer, I get [1, 32003, 0] which does not make sense. If I try this exact same input in Transformers 4.29.2, I get [1, 32003, 1871] which is correct.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers.models.llama.tokenization_llama import LlamaTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf", use_auth_token=...)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_tokens(['<TARGET_BEGIN>', '<TARGET_END>', '<REPR_BEGIN>', '<REPR_END>'], special_tokens=True)
print(tokenizer("<REPR_END>inform")
Expected behavior
I should expect to get the output [1, 32003, 1871] but I do not. I instead get [1, 32003, 0]