Closed
Description
System Info
transformers
version: 4.45.1- Platform: Linux-5.15.154+-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.23.2
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.2+cpu (False)
- Tensorflow version (GPU?): 2.15.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
- Jax version: 0.4.28
- JaxLib version: 0.4.28
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of 28k
, but when I use it to tokenize text, the first token id is still 128000
when it should have been the new tokenizer's BOS token id of 0
.
And here's a tokenization of an example text. As can be seen, the first token id is 128000
when it should have been 0
.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")
text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])
Output:
tensor([[128000, 1704, 802]])
Expected behavior
The first token id of the tokenized text should be the new tokenizer's BOS token id of 0
instead of the original llama 3.2 tokenizer's BOS token id of 128000
. The vocab size is 28000
and the number 128000
should not appear anywhere in the input_ids
list.
This is causing index out of range errors when indexing the embedding matrix of a newly initialized model.