Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer - add from_model_name method #13623

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 49 additions & 1 deletion src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,9 +348,57 @@ class AutoTokenizer:
def __init__(self):
raise EnvironmentError(
"AutoTokenizer is designed to be instantiated "
"using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method."
"using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` or `AutoTokenizer.from_model_name(config)` method."
)

@classmethod
@replace_list_option_in_docstrings(TOKENIZER_MAPPING_NAMES)
def from_model_name(cls, model_name, *args, **kwargs):
r"""
Instantiate one of the tokenizer classes of the library by passing the required vocabulary file.

The tokenizer class to instantiate is selected based on the :obj:`model_name` which is passed as an argument.

List options

Params:
model_name (:obj:`str`):
The :obj:`model_name` associated to the tokenizer class that should be instantiated. Should be one of
the keys shown in bold above.
use_fast (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to try to load the fast version of the tokenizer.
args (additional positional arguments, `optional`):
Will be passed to the Tokenizer ``__init__()`` method. Can be used to pass the required vocabulary
files such as ``vocab_file`` or ``merges_file``.
kwargs (additional keyword arguments, `optional`):
Will be passed to the Tokenizer ``__init__()`` method. Can be used to pass the required vocabulary
files such as ``vocab_file=/path/tol/vocab_file.json`` and/or ``merges_file=/path/to/merges_file.txt``
as well as to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``,
``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the
``__init__()`` for more details.

Examples::

>>> from transformers import AutoTokenizer

>>> # Instantiate BERT-like tokenizer
>>> tokenizer = AutoTokenizer.from_model_name("bert", vocab_file="./vocab.txt")

>>> # Instantiate GPT2-like tokenizer
>>> tokenizer = AutoTokenizer.from_model_name("gpt2", vocab_file="./vocab.json", merges_file="./merges.txt")
"""

use_fast = kwargs.pop("use_fast", True)

tokenizer_class_name, tokenizer_class_name_fast = TOKENIZER_MAPPING_NAMES[model_name]

if use_fast and tokenizer_class_name_fast is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion here, wouldn't we return an error message when the user explicitly requests a slow or fast version and it doesn't exist for the requested type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think a warning might be a good idea! In AutoTokenizer.from_pretrained(..., use_fast=True) we always fall back to the slow version if fast doesn't exist, so I think we should keep the same design here cc @sgugger @LysandreJik WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since use_fast=True is the default, the user usually did not explicitly request the tokenizer, so there should not be a warning (they scare users).

Another design could be to have the parameter default to None, and when it's None do the behavior we have right now (try fast then slow). When it's an explicit boolean, we could then either issue a warning or an error message. An error message would probably be nicer and would allow us to get rid of some assert tokenizer.is_fast in the examples.

In all cases, we should have the same design in AutoTokenizer class methods obviously.

Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it'd be a good idea to default to None and raise an warning/error if the boolean is explicitly set to True and there is no fast tokenizer. I think I'd opt for a warning though (maybe saying that this behavior will result in an error in the future) as it would be breaking for AutoTokenizer.from_pretrained(...)

tokenizer_cls_fast = tokenizer_class_from_name(tokenizer_class_name_fast)
return tokenizer_cls_fast(*args, **kwargs)

tokenizer_cls = tokenizer_class_from_name(tokenizer_class_name)
return tokenizer_cls(*args, **kwargs)

@classmethod
@replace_list_option_in_docstrings(TOKENIZER_MAPPING_NAMES)
def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
Expand Down
5 changes: 5 additions & 0 deletions tests/fixtures/merges.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#version: 0.2
Ġ l
Ġl o
Ġlo w
e r
1 change: 1 addition & 0 deletions tests/fixtures/vocab.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"l": 0, "o": 1, "w": 2, "e": 3, "r": 4, "s": 5, "t": 6, "i": 7, "d": 8, "n": 9, "Ġ": 10, "Ġl": 11, "Ġn": 12, "Ġlo": 13, "Ġlow": 14, "er": 15, "Ġlowest": 16, "Ġnewer": 17, "Ġwider": 18, "<unk>": 19, "<|endoftext|>": 20}
10 changes: 10 additions & 0 deletions tests/fixtures/vocab.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[PAD]
[SEP]
[MASK]
[CLS]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]
29 changes: 29 additions & 0 deletions tests/test_tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,35 @@ def test_tokenizer_from_tokenizer_class(self):
self.assertIsInstance(tokenizer, (BertTokenizer, BertTokenizerFast))
self.assertEqual(tokenizer.vocab_size, 12)

def test_tokenizer_from_name(self):
name = "bert"
vocab_file = "./tests/fixtures/vocab.txt"
tokenizer = AutoTokenizer.from_model_name(name, vocab_file=vocab_file, use_fast=False)
self.assertIsInstance(tokenizer, BertTokenizer)
self.assertEqual(tokenizer.vocab_size, 10)

name = "gpt2"
vocab_file = "./tests/fixtures/vocab.json"
merges_file = "./tests/fixtures/merges.txt"
tokenizer = AutoTokenizer.from_model_name(name, vocab_file=vocab_file, merges_file=merges_file, use_fast=False)
self.assertIsInstance(tokenizer, GPT2Tokenizer)
self.assertEqual(tokenizer.vocab_size, 21)

@require_tokenizers
def test_tokenizer_from_name_fast(self):
name = "bert"
vocab_file = "./tests/fixtures/vocab.txt"
tokenizer = AutoTokenizer.from_model_name(name, vocab_file=vocab_file)
self.assertIsInstance(tokenizer, BertTokenizerFast)
self.assertEqual(tokenizer.vocab_size, 10)

name = "gpt2"
vocab_file = "./tests/fixtures/vocab.json"
merges_file = "./tests/fixtures/merges.txt"
tokenizer = AutoTokenizer.from_model_name(name, vocab_file=vocab_file, merges_file=merges_file)
self.assertIsInstance(tokenizer, GPT2TokenizerFast)
self.assertEqual(tokenizer.vocab_size, 21)

@require_tokenizers
def test_tokenizer_identifier_with_correct_config(self):
for tokenizer_class in [BertTokenizer, BertTokenizerFast, AutoTokenizer]:
Expand Down