Skip to content

T5 Tokenizer requires protobuf package #25753

Closed
@sanchit-gandhi

Description

System Info

  • transformers version: 4.32.0.dev0
  • Platform: macOS-13.5.1-arm64-arm-64bit
  • Python version: 3.9.13
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

@ArthurZucker @sanchit-gandhi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Ensure protobuf is uninstalled:
pip uninstall protobuf
  1. Import the T5Tokenizer:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")

Traceback:

UnboundLocalError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = T5Tokenizer.from_pretrained("t5-base")

File ~/transformers/src/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1851     else:
   1852         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1854 return cls._from_pretrained(
   1855     resolved_vocab_files,
   1856     pretrained_model_name_or_path,
   1857     init_configuration,
   1858     *init_inputs,
   1859     token=token,
   1860     cache_dir=cache_dir,
   1861     local_files_only=local_files_only,
   1862     _commit_hash=commit_hash,
   1863     _is_local=is_local,
   1864     **kwargs,
   1865 )

File ~/transformers/src/transformers/tokenization_utils_base.py:2017, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2015 # Instantiate tokenizer.
   2016 try:
-> 2017     tokenizer = cls(*init_inputs, **init_kwargs)
   2018 except OSError:
   2019     raise OSError(
   2020         "Unable to load vocabulary from file. "
   2021         "Please check that the provided vocabulary is accessible and not corrupted."
   2022     )

File ~/transformers/src/transformers/models/t5/tokenization_t5.py:194, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, **kwargs)
    191 self.vocab_file = vocab_file
    192 self._extra_ids = extra_ids
--> 194 self.sp_model = self.get_spm_processor()

File ~/transformers/src/transformers/models/t5/tokenization_t5.py:200, in T5Tokenizer.get_spm_processor(self)
    198 with open(self.vocab_file, "rb") as f:
    199     sp_model = f.read()
--> 200     model_pb2 = import_protobuf()
    201     model = model_pb2.ModelProto.FromString(sp_model)
    202     if not self.legacy:

File ~/transformers/src/transformers/convert_slow_tokenizer.py:40, in import_protobuf()
     38     else:
     39         from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
---> 40 return sentencepiece_model_pb2

UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

This is occurring because we do import_protobuf in the init:

model_pb2 = import_protobuf()

But import_protobuf is ill-defined in the case that protobuf is not available:

def import_protobuf():
if is_protobuf_available():
import google.protobuf
if version.parse(google.protobuf.__version__) < version.parse("4.0.0"):
from transformers.utils import sentencepiece_model_pb2
else:
from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
return sentencepiece_model_pb2

=> if protobuf is not installed, then sentencepiece_model_pb2 will be un-defined

Has protobuf been made a soft-dependency for T5Tokenizer inadvertently in #24622? Or can sentencepiece_model_pb2 be defined without protobuf?

Expected behavior

Use T5Tokenizer without protobuf

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions