AlbertTokenizer doesn't decode special tokens properly

Related to #5142, `AlbertTokenizer` (which uses SentencePiece) doesn't decode special tokens (like [CLS], [MASK]) properly. This issue was discovered when adding the Nystromformer model (#14659), which uses this tokenizer.

To reproduce (Transformers v4.15 or below):

```
!pip install -q transformers sentencepiece

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained("albert-base-v1")

text = "hello world"
encoding = tokenizer(text)

for id in encoding.input_ids:
  print(id, tokenizer.decode([id]))
```
This prints:
```
2 
10975 hello
126 world
3 
```
As can be seen, the special tokens are added ([CLS] with ID=2 and [SEP] with id=3), but they are decoded to an empty string. This is because the `convert_tokens_to_string` [method](https://github.com/huggingface/transformers/blob/e68c3756fea7c811d02b8470539ae17ec3ec0e71/src/transformers/models/albert/tokenization_albert.py#L252) of `AlbertTokenizer` uses the decode method of Google's SentencePiece library, but this doesn't take into account special tokens. 

The issue does not occur with the fast tokenizer:
```
from transformers import AlbertTokenizerFast

tokenizer = AlbertTokenizerFast.from_pretrained("albert-base-v1")

text = "hello world"
encoding = tokenizer(text)

for id in encoding.input_ids:
  print(id, tokenizer.decode([id]))
```
Which prints:
```
2 [CLS]
10975 hello
126 world
3 [SEP]
```
A similar issue happened for T5, and this was fixed in #8435. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AlbertTokenizer doesn't decode special tokens properly #15003

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AlbertTokenizer doesn't decode special tokens properly #15003

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions