Skip to content

After tokenizers upgrade, the length of the token does not correspond to the length of the model #36532

@CurtainRight

Description

@CurtainRight

System Info

transformers:4.48.1
tokenizers:0.2.1
python:3.9

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

code snippet:

tokenizer = PegasusTokenizer.from_pretrained('IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese')
model = AutoModelForSeq2SeqLM.from_pretrained(
    'IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese',
    config=config
)

training_args = Seq2SeqTrainingArguments(
    output_dir=config['model_name'],
    evaluation_strategy="epoch", 
    # report_to="none",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=4,
    predict_with_generate=True,
    logging_steps=0.1

)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

错误信息:

Image

Trial process:
My original Trasnformers: 4.29.1 tokenizers: 0.13.3. The model is capable of reasoning and training normally.
After upgrading, the above error occurred and normal training was not possible. Therefore, I adjusted the length of the model to 'model. resice_tokec_embeddings' (len (tokenizer)). Original model length: 50000, tokenizer loading length: 50103. So the model I trained resulted in abnormal inference results.

Image

Try again, keep tokenizers at 0.13.3, upgrade trasnformers at 4.33.3 (1. I need to upgrade because NPU only supports version 4.3.20. 2. This version is the highest compatible with tokenizers). After switching to this version, training and reasoning are normal.As long as tokenizers is greater than 0.13.3, length changes

Expected behavior

I expect tokenizer to be compatible with the original code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions