Adding custom tokens makes the T5Tokenizer always strip spaces

## Environment info

- `transformers` version: 4.5.1
- Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
- Python version: 3.6.13
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

If it helps, here's also my `pip-chill`:

```text
black==19.10b0
corrupt-text==0.0.1
en-core-web-sm==3.0.0
fairseq==1.0.0a0+f6f220e
flake8==3.9.0
pep8==1.7.1
pip-chill==1.0.1
rope==0.14.0
sentencepiece==0.1.95
torchtext==0.8.0
transformers==4.5.1
wikiextractor==3.0.5
```

Note that `corrupt-text` is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the **to reproduce** section.

### Who can help

Since it's a tokenizer issue, probably @LysandreJik.

## Information

I'm using the `T5Tokenizer`. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give the `add_tokens` and `add_special_tokens` a list of `AddedToken` objects with `lstrip` and `rstrip` explicitly set to `False`.

The problem arises when using:
* [ ] the official example scripts: (give details below)
* [x] my own modified scripts: (give details below)

Check out the **to reproduce** section to get an example of a code that doesn't work.

The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [x] my own task or dataset: (give details below)

It's not really relevant for this problem but the code is, once again, in the **to reproduce** section.

This is likely related to https://github.com/huggingface/transformers/issues/7901.

## To reproduce

Try running this code:

```python
from transformers import T5Tokenizer
from tokenizers import AddedToken

text = "Bruh doits <do_not_touch>"

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_tokens([AddedToken("doits", lstrip=False, rstrip=False)])
tokenizer.add_special_tokens(
    {
        "additional_special_tokens": [
            AddedToken("<do_not_touch>", lstrip=False, rstrip=False)
        ]
    }
)

tokens = tokenizer.tokenize(text)
ids = tokenizer(
    text,
    add_special_tokens=False,
    padding=False,
    truncation=False,
    return_attention_mask=False,
)["input_ids"]

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Text after: {tokenizer.convert_tokens_to_string(tokens)}")
```

You will get this:

```text
Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', 'doits', '<do_not_touch>']
IDs: [3, 9465, 107, 32100, 32101]
Text after: Bruhdoits<do_not_touch>
```

## Expected behavior

We should get:

```text
Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', '▁', 'doits', '▁', '<do_not_touch>']
IDs: [3, 9465, 107, 3, 32100, 3, 32101]
Text after: Bruh doits <do_not_touch>
```

EDIT: Updated the code to have `rstrip=False`, since I made the mistake originally, but still acts the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions