Adding custom tokens makes the T5Tokenizer always strip spaces #11531
Description
Environment info
transformers
version: 4.5.1- Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
- Python version: 3.6.13
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
If it helps, here's also my pip-chill
:
black==19.10b0
corrupt-text==0.0.1
en-core-web-sm==3.0.0
fairseq==1.0.0a0+f6f220e
flake8==3.9.0
pep8==1.7.1
pip-chill==1.0.1
rope==0.14.0
sentencepiece==0.1.95
torchtext==0.8.0
transformers==4.5.1
wikiextractor==3.0.5
Note that corrupt-text
is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.
Who can help
Since it's a tokenizer issue, probably @LysandreJik.
Information
I'm using the T5Tokenizer
. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give the add_tokens
and add_special_tokens
a list of AddedToken
objects with lstrip
and rstrip
explicitly set to False
.
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
Check out the to reproduce section to get an example of a code that doesn't work.
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
It's not really relevant for this problem but the code is, once again, in the to reproduce section.
This is likely related to #7901.
To reproduce
Try running this code:
from transformers import T5Tokenizer
from tokenizers import AddedToken
text = "Bruh doits <do_not_touch>"
tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_tokens([AddedToken("doits", lstrip=False, rstrip=False)])
tokenizer.add_special_tokens(
{
"additional_special_tokens": [
AddedToken("<do_not_touch>", lstrip=False, rstrip=False)
]
}
)
tokens = tokenizer.tokenize(text)
ids = tokenizer(
text,
add_special_tokens=False,
padding=False,
truncation=False,
return_attention_mask=False,
)["input_ids"]
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Text after: {tokenizer.convert_tokens_to_string(tokens)}")
You will get this:
Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', 'doits', '<do_not_touch>']
IDs: [3, 9465, 107, 32100, 32101]
Text after: Bruhdoits<do_not_touch>
Expected behavior
We should get:
Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', '▁', 'doits', '▁', '<do_not_touch>']
IDs: [3, 9465, 107, 3, 32100, 3, 32101]
Text after: Bruh doits <do_not_touch>
EDIT: Updated the code to have rstrip=False
, since I made the mistake originally, but still acts the same.