Skip to content

PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120

Closed
@Gompyn

Description

@Gompyn

System Info

  • transformers version: 4.24.0
  • Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
  • Python version: 3.10.8
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. load a PreTrainedTokenizer that contains unique_no_split_tokens, e.g. EleutherAI/gpt-j-6B.
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')
  1. use the tokenizer to split a string that contains a unique_no_split_tokens, e.g. " <|extratoken_1|> ".
print(tokenizer(" <|extratoken_1|> ").input_ids)

Expected behavior

The tokenizer splits the string into 3 tokens (" ", "<|extratoken_1|>" and " "), and gives their ids ([220, 50257, 220]). This is the behavior of PreTrainedTokenizerFast.

But the actual behavior is that the PreTrainedTokenizer only gives the id of "<|extratoken_1|>", i.e. 50257

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions