Skip to content

[MBart50] Inconsistent decoding with additional special tokens between slow and fast tokenizers  #28287

Closed
@fleonce

Description

@fleonce

System Info

  • transformers version: 4.36.2
  • Platform: Linux-6.2.0-25-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.1
  • Safetensors version: 0.4.1
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Load a non-fast Tokenizer for mBART
  2. Add an additional special token to it
  3. Encode and then decode input containing previously added special token
from transformers import MBart50Tokenizer

tk = MBart50Tokenizer.from_pretrained('facebook/mbart-large-50')
tk.add_tokens('<token>', True)
print(tk.decode(tk("This is my example sentence with a special <token> token")["input_ids"]))
>>> 'en_XXThis is my example sentence with a special <token> token</s>'

This differs from the fast tokenizers' decoding scheme, as it will correctly decode the input with a space after en_XX. I believe this is due to the implementation for legacy_added_tokens in

legacy_added_tokens = set(self._added_tokens_encoder.keys()) - set(self.all_special_tokens) | {
token for token in self.additional_special_tokens if self.convert_tokens_to_ids(token) >= self.vocab_size
}
# To avoid mixing byte-level and unicode for byte-level BPT
# we need to build string separately for added tokens and byte-level tokens
# cf. https://github.com/huggingface/transformers/issues/1133
sub_texts = []
current_sub_text = []
# TODO @ArthurZ in version 5, special tokens should be handled in convert_tokens_to_string, while _convert_tokens_to_string
for token in filtered_tokens:
if skip_special_tokens and token in self.all_special_ids:
continue
if token in legacy_added_tokens:
if current_sub_text:
string = self.convert_tokens_to_string(current_sub_text)
if len(string) > 0:
sub_texts.append(string)
current_sub_text = []
sub_texts.append(token)
else:
current_sub_text.append(token)

and more specifically the second part of the set definition for legacy_added_tokens that accounts for special tokens that have been added manually after loading (?)

When disabling the special handling for legacy_added_tokens, the tokenization output would be correct, so I was primarily wondering for what reason this was added and whether removing this would potentially break other tokenizers.

Expected behavior

fast_tk = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50')
fast_tk.add_tokens('<token>', True)
print(fast_tk.decode(fast_tk("This is my example sentence with a special <token> token")["input_ids"])))
>>> 'en_XX This is my example sentence with a special <token> token</s>'

The decoding should match the fast tokenizers' output (?), at least I would assume so.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions