Skip to content

Is the BOS token id of 128000 **hardcoded** into the llama 3.2 tokenizer? #33998

Closed
@rasyosef

Description

@rasyosef

System Info

  • transformers version: 4.45.1
  • Platform: Linux-5.15.154+-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Huggingface_hub version: 0.23.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cpu (False)
  • Tensorflow version (GPU?): 2.15.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
  • Jax version: 0.4.28
  • JaxLib version: 0.4.28
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of 28k, but when I use it to tokenize text, the first token id is still 128000 when it should have been the new tokenizer's BOS token id of 0.

And here's a tokenization of an example text. As can be seen, the first token id is 128000 when it should have been 0.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")

text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])

Output:

tensor([[128000,   1704,    802]])

Expected behavior

The first token id of the tokenized text should be the new tokenizer's BOS token id of 0 instead of the original llama 3.2 tokenizer's BOS token id of 128000. The vocab size is 28000 and the number 128000 should not appear anywhere in the input_ids list.

This is causing index out of range errors when indexing the embedding matrix of a newly initialized model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions