Skip to content

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Closed
@mawilson1234

Description

@mawilson1234

Environment info

  • transformers version: 4.13.0
  • Platform: Windows-10-10.0.19043-SP0
  • Python version: 3.9.7
  • PyTorch version (GPU?): 1.10.0+cu113 (True)
  • Tensorflow version (GPU?): 2.7.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik @SaulLu

Information

Models I am using: DistilBERT, BERT, RoBERTa

The problem arises when using:

  • my own modified scripts: (see below)

The tasks I am working on is:

  • my own task or dataset: (see below)

To reproduce

When adding a new token to various models (so far found with DistilBERT, BERT, and RoBERTa), adding a new token using the add_tokens function changes how adjacent parts of the string are tokenized in subtle ways (for DistilBERT and BERT, this might depend on do_basic_tokenize being set to False when creating the tokenizer, at least in the examples I've found). (This might be related to the issue reported in #11531, but that one specifically mentions T5.) See the code below for details.

This doesn't seem like intended behavior based on what I can tell from looking at the documentation, but it's possible I'm misunderstanding something about the right way to add new tokens to produce the behavior I'd like. (Currently, to get the expected behavior, I've had to manually modify the vocab (+ merges file for RoBERTa), using additional scripting, and load the tokenizer from the modified files. If it'd be of use, I could post the code for that workaround here, but I've left it out for now since it's a bit long and may not be relevant.)

Steps to reproduce the behavior:

(Distil)BERT:

from transformers import DistilBertTokenizer, BertTokenizer

new_word = 'mynewword'

# BERT
bt = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', '##ne', '##w', '##word']

bt.tokenize('testing.')
# ['testing', '##.'] (note that the period is tokenized as '##.')

bt.add_tokens(new_word)
bt.tokenize('mynewword') # verify the new token now exists
# ['mynewword']

bt.tokenize('mynewword.')
# ['mynewword', '.'] (note that the period is tokenized as '.' rather than the expected '##.')

# DistilBERT
dbt = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_basic_tokenize=False)
dbt.tokenize('mynewword')
# ['my', '##ne', '##w', '##word']

dbt.tokenize('testing.')
# ['testing', '##.']

dbt.add_tokens(new_word)
dbt.tokenize('mynewword')
# ['mynewword']

dbt.tokenize('mynewword.')
# ['mynewword', '.'] (expected: ['mynewword', '##.'])

RoBERTa:

from transformers import RobertaTokenizer

new_word = 'mynewword'
rt = RobertaTokenizer.from_pretrained('roberta-base')

rt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', 'new', 'word']

rt.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa'] (note that the final token includes a preceding 'Ġ')

rt.add_tokens(new_word)
rt.tokenize('mynewword') # verify the new token was added
# ['mynewword']

rt.tokenize('A mynewword a')
# ['A', 'mynewword', 'a'] (note that the final token lacks a 'Ġ')

Expected behavior

Adding a token to a tokenizer should not affect tokenization of adjacent elements (when these are not part of the added token).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions