Skip to content

AlbertTokenizer doesn't decode special tokens properly #15003

Closed
@NielsRogge

Description

@NielsRogge

Related to #5142, AlbertTokenizer (which uses SentencePiece) doesn't decode special tokens (like [CLS], [MASK]) properly. This issue was discovered when adding the Nystromformer model (#14659), which uses this tokenizer.

To reproduce (Transformers v4.15 or below):

!pip install -q transformers sentencepiece

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained("albert-base-v1")

text = "hello world"
encoding = tokenizer(text)

for id in encoding.input_ids:
  print(id, tokenizer.decode([id]))

This prints:

2 
10975 hello
126 world
3 

As can be seen, the special tokens are added ([CLS] with ID=2 and [SEP] with id=3), but they are decoded to an empty string. This is because the convert_tokens_to_string method of AlbertTokenizer uses the decode method of Google's SentencePiece library, but this doesn't take into account special tokens.

The issue does not occur with the fast tokenizer:

from transformers import AlbertTokenizerFast

tokenizer = AlbertTokenizerFast.from_pretrained("albert-base-v1")

text = "hello world"
encoding = tokenizer(text)

for id in encoding.input_ids:
  print(id, tokenizer.decode([id]))

Which prints:

2 [CLS]
10975 hello
126 world
3 [SEP]

A similar issue happened for T5, and this was fixed in #8435.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions