Llama Tokenizer Unexpectedly Producing Unknown Token

### System Info

- `transformers` version: 4.29.2
- Platform: Linux-5.19.0-1027-aws-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)

### Who can help?

@ArthurZucker @younesbelkada I am trying to use special tokens with the LlamaTokenizer in Transformers 4.31.0 and with certain configurations of input, the tokenizer is returning a token id of 0 corresponding to the unknown token. For example, I have added the special token "<REPR_END>", and if I pass that through the tokenizer to get [1, 32003] which is good. Additionally if I pass the word "inform" through the tokenizer, I get [1, 1871], which is also good.

However, if I pass "<REPR_END>inform" through the tokenizer, I get [1, 32003, 0] which does not make sense. If I try this exact same input in Transformers 4.29.2, I get [1, 32003, 1871] which is correct.

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction
```python 
from transformers.models.llama.tokenization_llama import LlamaTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf", use_auth_token=...)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_tokens(['<TARGET_BEGIN>', '<TARGET_END>', '<REPR_BEGIN>', '<REPR_END>'], special_tokens=True)

print(tokenizer("<REPR_END>inform")
```

### Expected behavior

I should expect to get the output [1, 32003, 1871] but I do not. I instead get [1, 32003, 0]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama Tokenizer Unexpectedly Producing Unknown Token #25176

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama Tokenizer Unexpectedly Producing Unknown Token #25176

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions