Is the BOS token id of 128000 **hardcoded** into the llama 3.2 tokenizer?

### System Info

- `transformers` version: 4.45.1
- Platform: Linux-5.15.154+-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.23.2
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.1.2+cpu (False)
- Tensorflow version (GPU?): 2.15.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
- Jax version: 0.4.28
- JaxLib version: 0.4.28
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@ArthurZucker @itazap 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of `28k`, but when I use it to tokenize text, the first token id is still `128000` when it should have been the new tokenizer's **BOS token id** of `0`.

And here's a tokenization of an example text. As can be seen, the first token id is `128000` when it should have been `0`.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")

text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])
```

Output:
```
tensor([[128000,   1704,    802]])
```

### Expected behavior

The first token id of the tokenized text should be the new tokenizer's **BOS token id** of `0` instead of the original llama 3.2 tokenizer's BOS token id of `128000`. The vocab size is `28000` and the number `128000` should not appear anywhere in the `input_ids` list.

This is causing index out of range errors when indexing the embedding matrix of a newly initialized model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is the BOS token id of 128000 hardcoded into the llama 3.2 tokenizer? #33998

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is the BOS token id of 128000 **hardcoded** into the llama 3.2 tokenizer? #33998

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Is the BOS token id of 128000 hardcoded into the llama 3.2 tokenizer? #33998