Environment info
transformers version: 4.5.1
- Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
- Python version: 3.6.13
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
If it helps, here's also my pip-chill:
black==19.10b0
corrupt-text==0.0.1
en-core-web-sm==3.0.0
fairseq==1.0.0a0+f6f220e
flake8==3.9.0
pep8==1.7.1
pip-chill==1.0.1
rope==0.14.0
sentencepiece==0.1.95
torchtext==0.8.0
transformers==4.5.1
wikiextractor==3.0.5
Note that corrupt-text is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.
Who can help
Since it's a tokenizer issue, probably @LysandreJik.
Information
I'm using the T5Tokenizer. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give the add_tokens and add_special_tokens a list of AddedToken objects with lstrip and rstrip explicitly set to False.
The problem arises when using:
Check out the to reproduce section to get an example of a code that doesn't work.
The tasks I am working on is:
It's not really relevant for this problem but the code is, once again, in the to reproduce section.
This is likely related to #7901.
To reproduce
Try running this code:
from transformers import T5Tokenizer
from tokenizers import AddedToken
text = "Bruh doits <do_not_touch>"
tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_tokens([AddedToken("doits", lstrip=False, rstrip=False)])
tokenizer.add_special_tokens(
{
"additional_special_tokens": [
AddedToken("<do_not_touch>", lstrip=False, rstrip=False)
]
}
)
tokens = tokenizer.tokenize(text)
ids = tokenizer(
text,
add_special_tokens=False,
padding=False,
truncation=False,
return_attention_mask=False,
)["input_ids"]
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Text after: {tokenizer.convert_tokens_to_string(tokens)}")
You will get this:
Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', 'doits', '<do_not_touch>']
IDs: [3, 9465, 107, 32100, 32101]
Text after: Bruhdoits<do_not_touch>
Expected behavior
We should get:
Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', '▁', 'doits', '▁', '<do_not_touch>']
IDs: [3, 9465, 107, 3, 32100, 3, 32101]
Text after: Bruh doits <do_not_touch>
EDIT: Updated the code to have rstrip=False, since I made the mistake originally, but still acts the same.
Environment info
transformersversion: 4.5.1If it helps, here's also my
pip-chill:Note that
corrupt-textis a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.Who can help
Since it's a tokenizer issue, probably @LysandreJik.
Information
I'm using the
T5Tokenizer. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give theadd_tokensandadd_special_tokensa list ofAddedTokenobjects withlstripandrstripexplicitly set toFalse.The problem arises when using:
Check out the to reproduce section to get an example of a code that doesn't work.
The tasks I am working on is:
It's not really relevant for this problem but the code is, once again, in the to reproduce section.
This is likely related to #7901.
To reproduce
Try running this code:
You will get this:
Expected behavior
We should get:
EDIT: Updated the code to have
rstrip=False, since I made the mistake originally, but still acts the same.