Adding new tokens to various models changes tokenization of adjacent elements in strings

## Environment info
- `transformers` version: 4.13.0
- Platform: Windows-10-10.0.19043-SP0
- Python version: 3.9.7
- PyTorch version (GPU?): 1.10.0+cu113 (True)
- Tensorflow version (GPU?): 2.7.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

### Who can help
@LysandreJik @SaulLu 

## Information

Models I am using: DistilBERT, BERT, RoBERTa

The problem arises when using:
* my own modified scripts: (see below)

The tasks I am working on is:
* my own task or dataset: (see below)

## To reproduce

When adding a new token to various models (so far found with DistilBERT, BERT, and RoBERTa), adding a new token using the `add_tokens` function changes how adjacent parts of the string are tokenized in subtle ways (for DistilBERT and BERT, this might depend on `do_basic_tokenize` being set to `False` when creating the tokenizer, at least in the examples I've found). (This might be related to the issue reported in https://github.com/huggingface/transformers/issues/11531, but that one specifically mentions T5.) See the code below for details.

This doesn't seem like intended behavior based on what I can tell from looking at the documentation, but it's possible I'm misunderstanding something about the right way to add new tokens to produce the behavior I'd like. (Currently, to get the expected behavior, I've had to manually modify the vocab (+ merges file for RoBERTa), using additional scripting, and load the tokenizer from the modified files. If it'd be of use, I could post the code for that workaround here, but I've left it out for now since it's a bit long and may not be relevant.)

Steps to reproduce the behavior:

(Distil)BERT:
```python
from transformers import DistilBertTokenizer, BertTokenizer

new_word = 'mynewword'

# BERT
bt = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', '##ne', '##w', '##word']

bt.tokenize('testing.')
# ['testing', '##.'] (note that the period is tokenized as '##.')

bt.add_tokens(new_word)
bt.tokenize('mynewword') # verify the new token now exists
# ['mynewword']

bt.tokenize('mynewword.')
# ['mynewword', '.'] (note that the period is tokenized as '.' rather than the expected '##.')

# DistilBERT
dbt = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_basic_tokenize=False)
dbt.tokenize('mynewword')
# ['my', '##ne', '##w', '##word']

dbt.tokenize('testing.')
# ['testing', '##.']

dbt.add_tokens(new_word)
dbt.tokenize('mynewword')
# ['mynewword']

dbt.tokenize('mynewword.')
# ['mynewword', '.'] (expected: ['mynewword', '##.'])
```

RoBERTa:
```python
from transformers import RobertaTokenizer

new_word = 'mynewword'
rt = RobertaTokenizer.from_pretrained('roberta-base')

rt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', 'new', 'word']

rt.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa'] (note that the final token includes a preceding 'Ġ')

rt.add_tokens(new_word)
rt.tokenize('mynewword') # verify the new token was added
# ['mynewword']

rt.tokenize('A mynewword a')
# ['A', 'mynewword', 'a'] (note that the final token lacks a 'Ġ')
```

## Expected behavior

Adding a token to a tokenizer should not affect tokenization of adjacent elements (when these are not part of the added token).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions