Skip to content

Llama Tokenizer Unexpectedly Producing Unknown Token #25176

@rehaanahmad2013

Description

@rehaanahmad2013

System Info

  • transformers version: 4.29.2
  • Platform: Linux-5.19.0-1027-aws-x86_64-with-glibc2.31
  • Python version: 3.10.8
  • Huggingface_hub version: 0.14.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)

Who can help?

@ArthurZucker @younesbelkada I am trying to use special tokens with the LlamaTokenizer in Transformers 4.31.0 and with certain configurations of input, the tokenizer is returning a token id of 0 corresponding to the unknown token. For example, I have added the special token "<REPR_END>", and if I pass that through the tokenizer to get [1, 32003] which is good. Additionally if I pass the word "inform" through the tokenizer, I get [1, 1871], which is also good.

However, if I pass "<REPR_END>inform" through the tokenizer, I get [1, 32003, 0] which does not make sense. If I try this exact same input in Transformers 4.29.2, I get [1, 32003, 1871] which is correct.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers.models.llama.tokenization_llama import LlamaTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf", use_auth_token=...)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_tokens(['<TARGET_BEGIN>', '<TARGET_END>', '<REPR_BEGIN>', '<REPR_END>'], special_tokens=True)

print(tokenizer("<REPR_END>inform")

Expected behavior

I should expect to get the output [1, 32003, 1871] but I do not. I instead get [1, 32003, 0]

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions