Description
Environment info
transformers
version: 4.13.0- Platform: Windows-10-10.0.19043-SP0
- Python version: 3.9.7
- PyTorch version (GPU?): 1.10.0+cu113 (True)
- Tensorflow version (GPU?): 2.7.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Information
Models I am using: DistilBERT, BERT, RoBERTa
The problem arises when using:
- my own modified scripts: (see below)
The tasks I am working on is:
- my own task or dataset: (see below)
To reproduce
When adding a new token to various models (so far found with DistilBERT, BERT, and RoBERTa), adding a new token using the add_tokens
function changes how adjacent parts of the string are tokenized in subtle ways (for DistilBERT and BERT, this might depend on do_basic_tokenize
being set to False
when creating the tokenizer, at least in the examples I've found). (This might be related to the issue reported in #11531, but that one specifically mentions T5.) See the code below for details.
This doesn't seem like intended behavior based on what I can tell from looking at the documentation, but it's possible I'm misunderstanding something about the right way to add new tokens to produce the behavior I'd like. (Currently, to get the expected behavior, I've had to manually modify the vocab (+ merges file for RoBERTa), using additional scripting, and load the tokenizer from the modified files. If it'd be of use, I could post the code for that workaround here, but I've left it out for now since it's a bit long and may not be relevant.)
Steps to reproduce the behavior:
(Distil)BERT:
from transformers import DistilBertTokenizer, BertTokenizer
new_word = 'mynewword'
# BERT
bt = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', '##ne', '##w', '##word']
bt.tokenize('testing.')
# ['testing', '##.'] (note that the period is tokenized as '##.')
bt.add_tokens(new_word)
bt.tokenize('mynewword') # verify the new token now exists
# ['mynewword']
bt.tokenize('mynewword.')
# ['mynewword', '.'] (note that the period is tokenized as '.' rather than the expected '##.')
# DistilBERT
dbt = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_basic_tokenize=False)
dbt.tokenize('mynewword')
# ['my', '##ne', '##w', '##word']
dbt.tokenize('testing.')
# ['testing', '##.']
dbt.add_tokens(new_word)
dbt.tokenize('mynewword')
# ['mynewword']
dbt.tokenize('mynewword.')
# ['mynewword', '.'] (expected: ['mynewword', '##.'])
RoBERTa:
from transformers import RobertaTokenizer
new_word = 'mynewword'
rt = RobertaTokenizer.from_pretrained('roberta-base')
rt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', 'new', 'word']
rt.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa'] (note that the final token includes a preceding 'Ġ')
rt.add_tokens(new_word)
rt.tokenize('mynewword') # verify the new token was added
# ['mynewword']
rt.tokenize('A mynewword a')
# ['A', 'mynewword', 'a'] (note that the final token lacks a 'Ġ')
Expected behavior
Adding a token to a tokenizer should not affect tokenization of adjacent elements (when these are not part of the added token).