Skip to content

AddedToken problems in LlamaTokenizer #25232

Closed
@wlhgtc

Description

@wlhgtc

System Info

  • transformers version: 4.31.0
  • Platform: macOS-13.5-x86_64-i386-64bit
  • Python version: 3.9.5
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker This is a bug reported from my colleague. And I'm not sured whether it's in the list of #23909

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code:

from transformers import LlamaTokenizer

txt = "hello\n" + "<bot>" + "How are you"
dd = {"additional_special_tokens": ["<bot>"]}

tokenizer1 = LlamaTokenizer.from_pretrained(
    "./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
)
tokenizer2 = LlamaTokenizer.from_pretrained(
    "./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
)

tokenizer2.add_special_tokens(dd)
t1 = tokenizer1.tokenize(txt)
t2 = tokenizer2.tokenize(txt)
print(t1)
print(t2)

Output:

t1: ['▁hello', '<0x0A>', '<', 'bot', '>', 'How', '▁are', '▁you']
t2: ['▁hello', '<bot>', '▁How', '▁are', '▁you']

Expected behavior

Output:

t1: ['▁hello', '<0x0A>', '<', 'bot', '>', 'How', '▁are', '▁you']
t2: ['▁hello', '<0x0A>', '<bot>', '▁How', '▁are', '▁you']

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions