Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Perceiver IO #14487

Merged
merged 147 commits into from
Dec 8, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
147 commits
Select commit Hold shift + click to select a range
beef8c1
First draft
NielsRogge Aug 2, 2021
28f9541
Style and remove mlm
NielsRogge Sep 6, 2021
7f70799
Make forward pass work
NielsRogge Sep 6, 2021
7574fc0
More improvements
NielsRogge Sep 6, 2021
77d55ec
More improvements
NielsRogge Sep 7, 2021
bdccd62
Fix bug
NielsRogge Sep 7, 2021
7b7dcd2
More improvements
NielsRogge Sep 7, 2021
25d7725
More improvements
NielsRogge Sep 7, 2021
4a804b6
Add PerceiverTokenizer first draft
NielsRogge Sep 8, 2021
9a84428
Improve conversion script
NielsRogge Sep 8, 2021
65e4edd
More improvements
NielsRogge Sep 8, 2021
649c66a
Make conversion script work for the encoder
NielsRogge Sep 8, 2021
df1c0c9
Make conversion script work with local pickle files
NielsRogge Sep 8, 2021
6a8a981
Style & quality, fix-copies
NielsRogge Sep 8, 2021
79b3f9d
Add dummy input to conversion script
NielsRogge Sep 8, 2021
6d1fb56
Add absolute position embeddings to TextPreProcessor
NielsRogge Sep 8, 2021
9ef09dc
Make forward pass of encoder work
NielsRogge Sep 9, 2021
8e15a42
More improvements
NielsRogge Sep 10, 2021
8852bd6
Move text preprocessor to separate script
NielsRogge Sep 10, 2021
e003753
More improvements
NielsRogge Sep 10, 2021
cfe4d01
More improvements
NielsRogge Sep 10, 2021
2eb4869
Add post processor
NielsRogge Sep 10, 2021
091903e
Make MLM model work
NielsRogge Sep 10, 2021
4f6c31d
Style
NielsRogge Sep 10, 2021
edaf54d
Add PerceiverForMaskedLM
NielsRogge Sep 10, 2021
5a1dea3
Add PerceiverImagePreprocessor
NielsRogge Sep 13, 2021
af33282
Make style
NielsRogge Sep 13, 2021
63b556a
Make PerceiverForImageClassification work
NielsRogge Sep 13, 2021
54d5335
More improvements
NielsRogge Sep 14, 2021
853268e
More improvements
NielsRogge Sep 14, 2021
d579251
Use tokenizer in conversion script
NielsRogge Sep 14, 2021
e8a8772
Use PerceiverForMaskedLM in conversion script
NielsRogge Sep 14, 2021
f8293b9
Define custom PerceiverModelOutput
NielsRogge Sep 14, 2021
3a62362
Improve PerceiverAttention to make it work for both MLM and image cla…
NielsRogge Sep 14, 2021
7795f6d
More improvements
NielsRogge Sep 14, 2021
2c3342f
More improvements
NielsRogge Sep 15, 2021
3151607
More improvements to the conversion script
NielsRogge Sep 15, 2021
a2e6b0e
Make conversion script work for both MLM and image classification
NielsRogge Sep 15, 2021
c1dbe7c
Add PerceiverFeatureExtractor
NielsRogge Sep 15, 2021
e6d9122
More improvements
NielsRogge Sep 15, 2021
cfd32c6
Style and quality
NielsRogge Sep 15, 2021
07b090f
Add center cropping
NielsRogge Sep 15, 2021
4cd722c
Fix bug
NielsRogge Sep 15, 2021
4ed297e
Small fix
NielsRogge Sep 15, 2021
8d4b748
Add print statement
NielsRogge Sep 15, 2021
2bb92b7
Fix bug in image preprocessor
NielsRogge Sep 15, 2021
4248229
Fix bug with conversion script
NielsRogge Sep 15, 2021
a7f75a2
Make output position embeddings an nn.Parameter layer instead of nn.E…
NielsRogge Sep 15, 2021
4592338
Comment out print statements
NielsRogge Sep 16, 2021
dd91215
Add position encoding classes
NielsRogge Sep 16, 2021
ac82fce
More improvements
NielsRogge Sep 16, 2021
b369c09
Use position_encoding_kwargs
NielsRogge Sep 17, 2021
7d1863f
Add PerceiverForImageClassificationFourier
NielsRogge Sep 17, 2021
e77c6b4
Make style & quality
NielsRogge Sep 17, 2021
0a7c3f0
Add PerceiverForImageClassificationConvProcessing
NielsRogge Sep 17, 2021
d3bcf09
Style & quality
NielsRogge Sep 17, 2021
0e4241c
Add flow model
NielsRogge Sep 18, 2021
92c7c62
Move processors to modeling file
NielsRogge Sep 20, 2021
9933942
Make position encodings modular
NielsRogge Sep 20, 2021
00d2ce3
Make basic decoder use modular position encodings
NielsRogge Sep 20, 2021
f1276f8
Add PerceiverForOpticalFlow to conversion script
NielsRogge Sep 20, 2021
15ded27
Add AudioPreprocessor
NielsRogge Sep 21, 2021
1347c20
Make it possible for the basic decoder to use Fourier position embedd…
NielsRogge Sep 21, 2021
8bb1289
Add PerceiverForMultimodalAutoencoding
NielsRogge Sep 21, 2021
8c5d100
Improve model for optical flow
NielsRogge Sep 22, 2021
5dbea95
Improve _build_network_inputs method
NielsRogge Sep 22, 2021
5472500
Add print statement
NielsRogge Sep 22, 2021
fea12e6
Fix device issue
NielsRogge Sep 22, 2021
3daed24
Fix device of Fourier embeddings
NielsRogge Sep 23, 2021
a45c064
Add print statements for debugging
NielsRogge Sep 23, 2021
1e7b1c9
Add another print statement
NielsRogge Sep 23, 2021
8c0f886
Add another print statement
NielsRogge Sep 23, 2021
32cca82
Add another print statement
NielsRogge Sep 23, 2021
f1c3720
Add another print statement
NielsRogge Sep 23, 2021
275a59f
Improve PerceiverAudioPreprocessor
NielsRogge Sep 24, 2021
aedb68e
Improve conversion script for multimodal modal
NielsRogge Sep 24, 2021
adc1205
More improvements
NielsRogge Sep 24, 2021
89da95d
More improvements
NielsRogge Sep 25, 2021
a7f4870
Improve multimodal model
NielsRogge Sep 27, 2021
54021d3
Make forward pass multimodal model work
NielsRogge Sep 28, 2021
327d16c
More improvements
NielsRogge Sep 29, 2021
f3a2d0c
Improve tests
NielsRogge Oct 6, 2021
1f34526
Fix some more tests
NielsRogge Oct 6, 2021
7c4cbbc
Add output dataclasses
NielsRogge Oct 6, 2021
2a4dab2
Make more tests pass
NielsRogge Oct 7, 2021
1205dd9
Add print statements for debuggin
NielsRogge Oct 7, 2021
4408a69
Add tests for image classification
NielsRogge Oct 7, 2021
1a60c6a
Add PerceiverClassifierOutput
NielsRogge Oct 7, 2021
0a1bfcd
More improvements
NielsRogge Oct 7, 2021
27f7190
Make more tests pass for the optical flow model
NielsRogge Oct 7, 2021
6815bf7
Make style & quality
NielsRogge Oct 7, 2021
d7fedc7
Small improvements
NielsRogge Oct 7, 2021
06839cb
Don't support training for optical flow model for now
NielsRogge Oct 11, 2021
5acb88c
Fix _prepare_for_class for tests
NielsRogge Oct 11, 2021
db7b6bb
Make more tests pass, add some docs
NielsRogge Oct 12, 2021
0264043
Add multimodal model to tests
NielsRogge Oct 12, 2021
107c971
Minor fixes
NielsRogge Nov 3, 2021
ed7d7ea
Fix tests
NielsRogge Nov 4, 2021
f62a6f5
Improve conversion script
NielsRogge Nov 4, 2021
d32808b
Make fixup
NielsRogge Nov 4, 2021
08b67de
Remove pos_dim argument
NielsRogge Nov 4, 2021
e7f8329
Fix device issue
NielsRogge Nov 4, 2021
0a93591
Potential fix for OOM
NielsRogge Nov 4, 2021
1091cfe
Revert previous commit
NielsRogge Nov 4, 2021
4c10a9d
Fix test_initialization
NielsRogge Nov 5, 2021
06c7b06
Add print statements for debugging
NielsRogge Nov 5, 2021
adfda8f
Fix print statement
NielsRogge Nov 5, 2021
927dd92
Add print statement
NielsRogge Nov 5, 2021
786f57f
Add print statement
NielsRogge Nov 5, 2021
bde8cf3
Add print statement
NielsRogge Nov 5, 2021
d832391
Add print statement
NielsRogge Nov 8, 2021
8aa3228
Add print statement
NielsRogge Nov 8, 2021
5a84a3e
Add print statement
NielsRogge Nov 8, 2021
8887f98
Remove need for output_shape
NielsRogge Nov 8, 2021
f9800c5
Comment out output_shape
NielsRogge Nov 8, 2021
134bfc4
Remove unnecessary code
NielsRogge Nov 8, 2021
d5187fb
Improve docs
NielsRogge Nov 10, 2021
e9003fb
Fix make fixup
NielsRogge Nov 19, 2021
d965bca
Remove PerceiverTextProcessor from init
NielsRogge Nov 19, 2021
42630e7
Improve docs
NielsRogge Nov 19, 2021
29037ba
Small improvement
NielsRogge Nov 22, 2021
4a2b81a
Apply first batch of suggestions from code review
NielsRogge Nov 30, 2021
3235318
Apply more suggestions from code review
NielsRogge Nov 30, 2021
22becd9
Update docstrings
NielsRogge Nov 30, 2021
dc95e00
Define dicts beforehand for readability
NielsRogge Nov 30, 2021
31ae669
Rename task to architecture in conversion script, include PerceiverMo…
NielsRogge Dec 1, 2021
fa41b1a
Add print statements for debugging
NielsRogge Dec 1, 2021
a3f16f2
Fix tests on GPU
NielsRogge Dec 1, 2021
afcb875
Remove preprocessors, postprocessors and decoders from main init
NielsRogge Dec 1, 2021
c5e3af7
Add integration test
NielsRogge Dec 1, 2021
dc68fed
Fix docs
NielsRogge Dec 1, 2021
ffc6fde
Replace einops by torch
NielsRogge Dec 2, 2021
83a6776
Update for new docs frontend
NielsRogge Dec 2, 2021
46c8e04
Rename PerceiverForImageClassification
NielsRogge Dec 2, 2021
a358e38
Improve docs
NielsRogge Dec 2, 2021
c5ae758
Improve docs
NielsRogge Dec 2, 2021
48503c0
Improve docs of PerceiverModel
NielsRogge Dec 2, 2021
ec0e016
Fix some more tests
NielsRogge Dec 3, 2021
da79d8a
Improve center_crop
NielsRogge Dec 3, 2021
2a3c57c
Add PerceiverForSequenceClassification
NielsRogge Dec 3, 2021
60eefd7
Small improvements
NielsRogge Dec 6, 2021
b36ba76
Fix tests
NielsRogge Dec 6, 2021
e8cf21a
Add integration test for optical flow model
NielsRogge Dec 7, 2021
e084c05
Clean up
NielsRogge Dec 7, 2021
d1c0245
Add tests for tokenizer
NielsRogge Dec 7, 2021
520f132
Fix tokenizer by adding special tokens properly
NielsRogge Dec 8, 2021
cf534be
Fix CI
NielsRogge Dec 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions src/transformers/models/perceiver/tokenization_perceiver.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,12 @@ class PerceiverTokenizer(PreTrainedTokenizer):

def __init__(
self,
pad_token="<pad>",
bos_token="<s>",
eos_token="</s>",
mask_token="<mask>",
cls_token="<cls>",
sep_token="<sep>",
pad_token="[PAD]",
bos_token="[BOS]",
eos_token="[EOS]",
mask_token="[MASK]",
cls_token="[CLS]",
sep_token="[SEP]",
model_max_length=2048,
**kwargs
) -> None:
Expand Down Expand Up @@ -127,7 +127,7 @@ def get_special_tokens_mask(

# normal case: some special tokens
if token_ids_1 is None:
return ([0] * len(token_ids_0)) + [1]
return [0] * len(token_ids_0)
return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]

def build_inputs_with_special_tokens(
Expand All @@ -138,7 +138,7 @@ def build_inputs_with_special_tokens(
following format:

- single sequence: ``X``
- pair of sequences: ``A </s> B </s>``
- pair of sequences: ``A [SEP] B [SEP]``

Args:
token_ids_0 (:obj:`List[int]`):
Expand All @@ -152,7 +152,7 @@ def build_inputs_with_special_tokens(
if token_ids_1 is None:
return token_ids_0
else:
return token_ids_0 + token_ids_1
return token_ids_0 + [self.sep_token_id] + token_ids_1 + [self.sep_token_id]

def _tokenize(self, text: str) -> List[str]:
"""Take as input a string and return a list of strings (tokens) for words/sub-words"""
Expand Down
284 changes: 284 additions & 0 deletions tests/test_tokenization_perceiver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import os
import re
import shutil
import tempfile
import unittest
from typing import Tuple

from transformers import AddedToken, BatchEncoding, PerceiverTokenizer
from transformers.file_utils import cached_property, is_tf_available, is_torch_available

from .test_tokenization_common import TokenizerTesterMixin


if is_torch_available():
FRAMEWORK = "pt"
elif is_tf_available():
FRAMEWORK = "tf"
else:
FRAMEWORK = "jax"


class PerceiverTokenizationTest(TokenizerTesterMixin, unittest.TestCase):

tokenizer_class = PerceiverTokenizer
test_rust_tokenizer = False

def setUp(self):
super().setUp()
tokenizer = PerceiverTokenizer()
tokenizer.save_pretrained(self.tmpdirname)

@cached_property
def perceiver_tokenizer(self):
return PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")

def get_tokenizer(self, **kwargs) -> PerceiverTokenizer:
return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)

def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5) -> Tuple[str, list]:
# XXX The default common tokenizer tests assume that every ID is decodable on its own.
# This assumption is invalid for Perceiver because single bytes might not be
# valid utf-8 (byte 128 for instance).
# Here we're overriding the smallest possible method to provide
# a clean sequence without making the same assumption.

toks = []
for i in range(len(tokenizer)):
try:
tok = tokenizer.decode([i], clean_up_tokenization_spaces=False)
except UnicodeDecodeError:
pass
toks.append((i, tok))

toks = list(filter(lambda t: re.match(r"^[ a-zA-Z]+$", t[1]), toks))
toks = list(filter(lambda t: [t[0]] == tokenizer.encode(t[1], add_special_tokens=False), toks))
if max_length is not None and len(toks) > max_length:
toks = toks[:max_length]
if min_length is not None and len(toks) < min_length and len(toks) > 0:
while len(toks) < min_length:
toks = toks + toks
# toks_str = [t[1] for t in toks]
toks_ids = [t[0] for t in toks]

# Ensure consistency
output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
if " " not in output_txt and len(toks_ids) > 1:
output_txt = (
tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+ " "
+ tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
)
if with_prefix_space:
output_txt = " " + output_txt
output_ids = tokenizer.encode(output_txt, add_special_tokens=False)
return output_txt, output_ids

def test_multibytes_char(self):
tokenizer = self.perceiver_tokenizer
src_text = "Unicode €."
encoded = tokenizer(src_text)
encoded_ids = [91, 116, 111, 105, 117, 106, 107, 38, 232, 136, 178, 52]
self.assertEqual(encoded["input_ids"], encoded_ids)

# decoding
decoded = tokenizer.decode(encoded_ids)
self.assertEqual(decoded, "Unicode €.")

encoded = tokenizer("e è é ê ë")
encoded_ids = [107, 38, 201, 174, 38, 201, 175, 38, 201, 176, 38, 201, 177]
self.assertEqual(encoded["input_ids"], encoded_ids)
# decoding
decoded = tokenizer.decode(encoded_ids)
self.assertEqual(decoded, "e è é ê ë")

# encode/decode, but with `encode` instead of `__call__`
self.assertEqual(tokenizer.decode(tokenizer.encode("e è é ê ë")), "e è é ê ë")
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved

def test_prepare_batch_integration(self):
tokenizer = self.perceiver_tokenizer
src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
# fmt: off
expected_src_tokens = [71, 38, 114, 117, 116, 109, 38, 118, 103, 120, 103, 109, 120, 103, 118, 110, 38, 108, 117, 120, 38, 121, 123, 115, 115, 103, 120, 111, 128, 103, 122, 111, 117, 116, 52, 0]
# fmt: on
batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
self.assertIsInstance(batch, BatchEncoding)

if FRAMEWORK != "jax":
result = list(batch.input_ids.numpy()[0])
else:
result = list(batch.input_ids.tolist()[0])

self.assertListEqual(expected_src_tokens, result)

self.assertEqual((2, 36), batch.input_ids.shape)
self.assertEqual((2, 36), batch.attention_mask.shape)

def test_empty_target_text(self):
tokenizer = self.perceiver_tokenizer
src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
# check if input_ids are returned and no decoder_input_ids
self.assertIn("input_ids", batch)
self.assertIn("attention_mask", batch)
self.assertNotIn("decoder_input_ids", batch)
self.assertNotIn("decoder_attention_mask", batch)

def test_max_length_integration(self):
tokenizer = self.perceiver_tokenizer
tgt_text = [
"Summary of the text.",
"Another summary.",
]
with tokenizer.as_target_tokenizer():
targets = tokenizer(
tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
)
self.assertEqual(32, targets["input_ids"].shape[1])

# cannot use default save_and_load_tokenzier test method because tokenzier has no vocab
def test_save_and_load_tokenizer(self):
# safety check on max_len default value so we are sure the test works
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
self.assertNotEqual(tokenizer.model_max_length, 42)

# Now let's start the test
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()

sample_text = " He is very happy, UNwant\u00E9d,running"
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)

after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)

shutil.rmtree(tmpdirname)

tokenizers = self.get_tokenizers(model_max_length=42)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()

sample_text = " He is very happy, UNwant\u00E9d,running"
tokenizer.add_tokens(["bim", "bambam"])
additional_special_tokens = tokenizer.additional_special_tokens
additional_special_tokens.append("new_additional_special_token")
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)

after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
self.assertEqual(after_tokenizer.model_max_length, 42)

tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
self.assertEqual(tokenizer.model_max_length, 43)

shutil.rmtree(tmpdirname)

# There is a conflict between the default value of extra_ids and adding a new special token through additional_special_tokens
# We need to add the extra_ids in the list of the arg additional_special_tokens
def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
tokenizer_list = []
if self.test_slow_tokenizer:
tokenizer_list.append((self.tokenizer_class, self.get_tokenizer()))

if self.test_rust_tokenizer:
tokenizer_list.append((self.rust_tokenizer_class, self.get_rust_tokenizer()))

for tokenizer_class, tokenizer_utils in tokenizer_list:
with tempfile.TemporaryDirectory() as tmp_dir:
tokenizer_utils.save_pretrained(tmp_dir)

with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
special_tokens_map = json.load(json_file)

with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
tokenizer_config = json.load(json_file)

added_tokens_extra_ids = [f"<extra_id_{i}>" for i in range(125)]

special_tokens_map["additional_special_tokens"] = added_tokens_extra_ids + [
"an_additional_special_token"
]
tokenizer_config["additional_special_tokens"] = added_tokens_extra_ids + [
"an_additional_special_token"
]

with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
json.dump(special_tokens_map, outfile)
with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
json.dump(tokenizer_config, outfile)

# the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
# into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
# "special_tokens_map.json" files
tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
tmp_dir,
)
self.assertIn(
"an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
)
self.assertEqual(
["an_additional_special_token"],
tokenizer_without_change_in_init.convert_ids_to_tokens(
tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
),
)

# Now we test that we can change the value of additional_special_tokens in the from_pretrained
new_added_tokens = added_tokens_extra_ids + [AddedToken("a_new_additional_special_token", lstrip=True)]
tokenizer = tokenizer_class.from_pretrained(
tmp_dir,
additional_special_tokens=new_added_tokens,
)

self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
self.assertEqual(
["a_new_additional_special_token"],
tokenizer.convert_ids_to_tokens(
tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
),
)

# tokenizer can be instantiated without any pretrained files, so no need for pretrained tokenizer list
def test_pretrained_model_lists(self):
pass

# tokenizer does not have vocabulary
def test_get_vocab(self):
pass

# inputs cannot be pretokenized since ids depend on whole input string and not just on single characters
def test_pretokenized_inputs(self):
pass

# tests all ids in vocab => vocab doesn't exist so unnecessary to test
def test_conversion_reversible(self):
pass