Closed
Description
System Info
transformers
version: 4.24.0- Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behavior:
- load a
PreTrainedTokenizer
that containsunique_no_split_tokens
, e.g.EleutherAI/gpt-j-6B
.
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')
- use the tokenizer to split a string that contains a
unique_no_split_tokens
, e.g." <|extratoken_1|> "
.
print(tokenizer(" <|extratoken_1|> ").input_ids)
Expected behavior
The tokenizer splits the string into 3 tokens (" "
, "<|extratoken_1|>"
and " "
), and gives their ids ([220, 50257, 220]
). This is the behavior of PreTrainedTokenizerFast
.
But the actual behavior is that the PreTrainedTokenizer
only gives the id of "<|extratoken_1|>"
, i.e. 50257