Closed
Description
System Info
transformers
version: 4.31.0- Platform: macOS-13.5-x86_64-i386-64bit
- Python version: 3.9.5
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@ArthurZucker This is a bug reported from my colleague. And I'm not sured whether it's in the list of #23909
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Code:
from transformers import LlamaTokenizer
txt = "hello\n" + "<bot>" + "How are you"
dd = {"additional_special_tokens": ["<bot>"]}
tokenizer1 = LlamaTokenizer.from_pretrained(
"./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
)
tokenizer2 = LlamaTokenizer.from_pretrained(
"./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
)
tokenizer2.add_special_tokens(dd)
t1 = tokenizer1.tokenize(txt)
t2 = tokenizer2.tokenize(txt)
print(t1)
print(t2)
Output:
t1: ['▁hello', '<0x0A>', '<', 'bot', '>', 'How', '▁are', '▁you']
t2: ['▁hello', '<bot>', '▁How', '▁are', '▁you']
Expected behavior
Output:
t1: ['▁hello', '<0x0A>', '<', 'bot', '>', 'How', '▁are', '▁you']
t2: ['▁hello', '<0x0A>', '<bot>', '▁How', '▁are', '▁you']
Metadata
Metadata
Assignees
Labels
No labels