Skip to content

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Closed
@suflaj

Description

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Python version: 3.6.13
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

If it helps, here's also my pip-chill:

black==19.10b0
corrupt-text==0.0.1
en-core-web-sm==3.0.0
fairseq==1.0.0a0+f6f220e
flake8==3.9.0
pep8==1.7.1
pip-chill==1.0.1
rope==0.14.0
sentencepiece==0.1.95
torchtext==0.8.0
transformers==4.5.1
wikiextractor==3.0.5

Note that corrupt-text is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.

Who can help

Since it's a tokenizer issue, probably @LysandreJik.

Information

I'm using the T5Tokenizer. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give the add_tokens and add_special_tokens a list of AddedToken objects with lstrip and rstrip explicitly set to False.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

Check out the to reproduce section to get an example of a code that doesn't work.

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

It's not really relevant for this problem but the code is, once again, in the to reproduce section.

This is likely related to #7901.

To reproduce

Try running this code:

from transformers import T5Tokenizer
from tokenizers import AddedToken

text = "Bruh doits <do_not_touch>"

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_tokens([AddedToken("doits", lstrip=False, rstrip=False)])
tokenizer.add_special_tokens(
    {
        "additional_special_tokens": [
            AddedToken("<do_not_touch>", lstrip=False, rstrip=False)
        ]
    }
)

tokens = tokenizer.tokenize(text)
ids = tokenizer(
    text,
    add_special_tokens=False,
    padding=False,
    truncation=False,
    return_attention_mask=False,
)["input_ids"]

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Text after: {tokenizer.convert_tokens_to_string(tokens)}")

You will get this:

Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', 'doits', '<do_not_touch>']
IDs: [3, 9465, 107, 32100, 32101]
Text after: Bruhdoits<do_not_touch>

Expected behavior

We should get:

Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', '▁', 'doits', '▁', '<do_not_touch>']
IDs: [3, 9465, 107, 3, 32100, 3, 32101]
Text after: Bruh doits <do_not_touch>

EDIT: Updated the code to have rstrip=False, since I made the mistake originally, but still acts the same.

Metadata

Assignees

No one assigned

    Labels

    WIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions