Skip to content

AddedToken 's argument are ignored when called in add_tokens 's method of slow tokenizers #20734

Closed
@SaulLu

Description

@SaulLu

System Info

  • transformers version: 4.25.1
  • Platform: Linux-5.10.133+-x86_64-with-glibc2.27
  • Python version: 3.8.16
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.0+cu116 (False)
  • Tensorflow version (GPU?): 2.9.2 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The explanations of the bug and its reproduction are contained in the following google colab: https://colab.research.google.com/drive/19SS6Tzlgo0vntFtM6ZsCYq8BNZ5Dy1cS?usp=sharing

Expected behavior

I would expect the fast and slow tokenizers to treat the AddedToken's arguments in the same way.

I think the loss of information for the slow tokenizer occurs at this line:

new_tokens = [str(tok) for tok in new_tokens]

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions