Closed
Description
Related to #5142, AlbertTokenizer
(which uses SentencePiece) doesn't decode special tokens (like [CLS], [MASK]) properly. This issue was discovered when adding the Nystromformer model (#14659), which uses this tokenizer.
To reproduce (Transformers v4.15 or below):
!pip install -q transformers sentencepiece
from transformers import AlbertTokenizer
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v1")
text = "hello world"
encoding = tokenizer(text)
for id in encoding.input_ids:
print(id, tokenizer.decode([id]))
This prints:
2
10975 hello
126 world
3
As can be seen, the special tokens are added ([CLS] with ID=2 and [SEP] with id=3), but they are decoded to an empty string. This is because the convert_tokens_to_string
method of AlbertTokenizer
uses the decode method of Google's SentencePiece library, but this doesn't take into account special tokens.
The issue does not occur with the fast tokenizer:
from transformers import AlbertTokenizerFast
tokenizer = AlbertTokenizerFast.from_pretrained("albert-base-v1")
text = "hello world"
encoding = tokenizer(text)
for id in encoding.input_ids:
print(id, tokenizer.decode([id]))
Which prints:
2 [CLS]
10975 hello
126 world
3 [SEP]
A similar issue happened for T5, and this was fixed in #8435.