Skip to content

PoC for a ProcessorMixin class #15549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions docs/source/main_classes/processors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License.

# Processors

This library includes processors for several traditional tasks. These processors can be used to process a dataset into
examples that can be fed to a model.
Processors can mean two different things in the Transformers library:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking this part of the docs!

- the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
or [CLIP](../model_doc/clip) (text and vision)
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.

## Processors
## Multi-modal processors

Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
vision and audio). This is handled by objects called processors, which group tokenizers (for the text modality) and
feature extractors (for vision and audio).

Those processors inherit from the following base class that implements the saving and loading functionality:

[[autodoc]] ProcessorMixin

## Deprecated processors

All processors follow the same architecture which is that of the
[`~data.processors.utils.DataProcessor`]. The processor returns a list of
Expand Down
6 changes: 4 additions & 2 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@
"dependency_versions_table": [],
"dynamic_module_utils": [],
"feature_extraction_sequence_utils": ["SequenceFeatureExtractor"],
"feature_extraction_utils": ["BatchFeature"],
"feature_extraction_utils": ["BatchFeature", "FeatureExtractionMixin"],
"file_utils": [
"CONFIG_NAME",
"MODEL_CARD_NAME",
Expand Down Expand Up @@ -365,6 +365,7 @@
"ZeroShotClassificationPipeline",
"pipeline",
],
"processing_utils": ["ProcessorMixin"],
"testing_utils": [],
"tokenization_utils": ["PreTrainedTokenizer"],
"tokenization_utils_base": [
Expand Down Expand Up @@ -2297,7 +2298,7 @@
from .feature_extraction_sequence_utils import SequenceFeatureExtractor

# Feature Extractor
from .feature_extraction_utils import BatchFeature
from .feature_extraction_utils import BatchFeature, FeatureExtractionMixin

# Files and general utilities
from .file_utils import (
Expand Down Expand Up @@ -2545,6 +2546,7 @@
ZeroShotClassificationPipeline,
pipeline,
)
from .processing_utils import ProcessorMixin

# Tokenization
from .tokenization_utils import PreTrainedTokenizer
Expand Down
75 changes: 5 additions & 70 deletions src/transformers/models/clip/processing_clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,11 @@
"""
Image/Text processor class for CLIP
"""
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
from .feature_extraction_clip import CLIPFeatureExtractor
from .tokenization_clip import CLIPTokenizer


class CLIPProcessor:
class CLIPProcessor(ProcessorMixin):
r"""
Constructs a CLIP processor which wraps a CLIP feature extractor and a CLIP tokenizer into a single processor.

Expand All @@ -33,77 +32,13 @@ class CLIPProcessor:
tokenizer ([`CLIPTokenizer`]):
The tokenizer is a required input.
"""
feature_extractor_class = "CLIPFeatureExtractor"
tokenizer_class = "CLIPTokenizer"

def __init__(self, feature_extractor, tokenizer):
if not isinstance(feature_extractor, CLIPFeatureExtractor):
raise ValueError(
f"`feature_extractor` has to be of type CLIPFeatureExtractor, but is {type(feature_extractor)}"
)
if not isinstance(tokenizer, CLIPTokenizer):
raise ValueError(f"`tokenizer` has to be of type CLIPTokenizer, but is {type(tokenizer)}")

self.feature_extractor = feature_extractor
self.tokenizer = tokenizer
super().__init__(feature_extractor, tokenizer)
self.current_processor = self.feature_extractor

def save_pretrained(self, save_directory):
"""
Save a CLIP feature extractor object and CLIP tokenizer object to the directory `save_directory`, so that it
can be re-loaded using the [`~CLIPProcessor.from_pretrained`] class method.

<Tip>

This class method is simply calling [`~PreTrainedFeatureExtractor.save_pretrained`] and
[`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
above for more information.

</Tip>

Args:
save_directory (`str` or `os.PathLike`):
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
be created if it does not exist).
"""
self.feature_extractor._set_processor_class(self.__class__.__name__)
self.feature_extractor.save_pretrained(save_directory)

self.tokenizer._set_processor_class(self.__class__.__name__)
self.tokenizer.save_pretrained(save_directory)

@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
r"""
Instantiate a [`CLIPProcessor`] from a pretrained CLIP processor.

<Tip>

This class method is simply calling CLIPFeatureExtractor's [`~PreTrainedFeatureExtractor.from_pretrained`] and
CLIPTokenizer's [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. Please refer to the
docstrings of the methods above for more information.

</Tip>

Args:
pretrained_model_name_or_path (`str` or `os.PathLike`):
This can be either:

- a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like `clip-vit-base-patch32`, or
namespaced under a user or organization name, like `openai/clip-vit-base-patch32`.
- a path to a *directory* containing a feature extractor file saved using the
[`~PreTrainedFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
- a path or url to a saved feature extractor JSON *file*, e.g.,
`./my_model_directory/preprocessor_config.json`.

**kwargs
Additional keyword arguments passed along to both [`PreTrainedFeatureExtractor`] and
[`PreTrainedTokenizer`]
"""
feature_extractor = CLIPFeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs)
tokenizer = CLIPTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)

return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)

def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
Expand Down
86 changes: 4 additions & 82 deletions src/transformers/models/layoutlmv2/processing_layoutlmv2.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,11 @@
from typing import List, Optional, Union

from ...file_utils import TensorType
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
from .feature_extraction_layoutlmv2 import LayoutLMv2FeatureExtractor
from .tokenization_layoutlmv2 import LayoutLMv2Tokenizer
from .tokenization_layoutlmv2_fast import LayoutLMv2TokenizerFast


class LayoutLMv2Processor:
class LayoutLMv2Processor(ProcessorMixin):
r"""
Constructs a LayoutLMv2 processor which combines a LayoutLMv2 feature extractor and a LayoutLMv2 tokenizer into a
single processor.
Expand All @@ -43,84 +41,8 @@ class LayoutLMv2Processor:
tokenizer (`LayoutLMv2Tokenizer` or `LayoutLMv2TokenizerFast`):
An instance of [`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]. The tokenizer is a required input.
"""

def __init__(self, feature_extractor, tokenizer):
if not isinstance(feature_extractor, LayoutLMv2FeatureExtractor):
raise ValueError(
f"`feature_extractor` has to be of type {LayoutLMv2FeatureExtractor.__class__}, but is {type(feature_extractor)}"
)
if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
raise ValueError(
f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
)

self.feature_extractor = feature_extractor
self.tokenizer = tokenizer

def save_pretrained(self, save_directory):
"""
Save a LayoutLMv2 feature_extractor object and LayoutLMv2 tokenizer object to the directory `save_directory`,
so that it can be re-loaded using the [`~LayoutLMv2Processor.from_pretrained`] class method.

<Tip>

This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
[`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
above for more information.

</Tip>

Args:
save_directory (`str` or `os.PathLike`):
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
be created if it does not exist).
"""
self.feature_extractor._set_processor_class(self.__class__.__name__)
self.feature_extractor.save_pretrained(save_directory)

self.tokenizer._set_processor_class(self.__class__.__name__)
self.tokenizer.save_pretrained(save_directory)

@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, use_fast=True, **kwargs):
r"""
Instantiate a [`LayoutLMv2Processor`] from a pretrained LayoutLMv2 processor.

<Tip>

This class method is simply calling LayoutLMv2FeatureExtractor's
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and LayoutLMv2TokenizerFast's
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. Please refer to the docstrings of the methods
above for more information.

</Tip>

Args:
pretrained_model_name_or_path (`str` or `os.PathLike`):
This can be either:

- a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
- a path to a *directory* containing a feature extractor file saved using the
[`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
- a path or url to a saved feature extractor JSON *file*, e.g.,
`./my_model_directory/preprocessor_config.json`.

use_fast (`bool`, *optional*, defaults to `True`):
Whether or not to instantiate a fast tokenizer.

**kwargs
Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
[`PreTrainedTokenizer`]
"""
feature_extractor = LayoutLMv2FeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs)
if use_fast:
tokenizer = LayoutLMv2TokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
else:
tokenizer = LayoutLMv2Tokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)

return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)
feature_extractor_class = "LayoutLMv2FeatureExtractor"
tokenizer_class = ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast")

def __call__(
self,
Expand Down
87 changes: 4 additions & 83 deletions src/transformers/models/layoutxlm/processing_layoutxlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,12 @@
"""
from typing import List, Optional, Union

from transformers.models.layoutlmv2.feature_extraction_layoutlmv2 import LayoutLMv2FeatureExtractor

from ...file_utils import TensorType
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
from .tokenization_layoutxlm import LayoutXLMTokenizer
from .tokenization_layoutxlm_fast import LayoutXLMTokenizerFast


class LayoutXLMProcessor:
class LayoutXLMProcessor(ProcessorMixin):
r"""
Constructs a LayoutXLM processor which combines a LayoutXLM feature extractor and a LayoutXLM tokenizer into a
single processor.
Expand All @@ -44,84 +41,8 @@ class LayoutXLMProcessor:
tokenizer (`LayoutXLMTokenizer` or `LayoutXLMTokenizerFast`):
An instance of [`LayoutXLMTokenizer`] or [`LayoutXLMTokenizerFast`]. The tokenizer is a required input.
"""

def __init__(self, feature_extractor, tokenizer):
if not isinstance(feature_extractor, LayoutLMv2FeatureExtractor):
raise ValueError(
f"`feature_extractor` has to be of type {LayoutLMv2FeatureExtractor.__class__}, but is {type(feature_extractor)}"
)
if not isinstance(tokenizer, (LayoutXLMTokenizer, LayoutXLMTokenizerFast)):
raise ValueError(
f"`tokenizer` has to be of type {LayoutXLMTokenizer.__class__} or {LayoutXLMTokenizerFast.__class__}, but is {type(tokenizer)}"
)

self.feature_extractor = feature_extractor
self.tokenizer = tokenizer

def save_pretrained(self, save_directory):
"""
Save a LayoutXLM feature_extractor object and LayoutXLM tokenizer object to the directory `save_directory`, so
that it can be re-loaded using the [`~LayoutXLMProcessor.from_pretrained`] class method.

<Tip>

This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
[`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
above for more information.

</Tip>

Args:
save_directory (`str` or `os.PathLike`):
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
be created if it does not exist).
"""
self.feature_extractor._set_processor_class(self.__class__.__name__)
self.feature_extractor.save_pretrained(save_directory)

self.tokenizer._set_processor_class(self.__class__.__name__)
self.tokenizer.save_pretrained(save_directory)

@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, use_fast=True, **kwargs):
r"""
Instantiate a [`LayoutXLMProcessor`] from a pretrained LayoutXLM processor.

<Tip>

This class method is simply calling Layoutv2FeatureExtractor's
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and LayoutXLMTokenizerFast's
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. Please refer to the docstrings of the methods
above for more information.

</Tip>

Args:
pretrained_model_name_or_path (`str` or `os.PathLike`):
This can be either:

- a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
- a path to a *directory* containing a feature extractor file saved using the
[`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
- a path or url to a saved feature extractor JSON *file*, e.g.,
`./my_model_directory/preprocessor_config.json`.

use_fast (`bool`, *optional*, defaults to `True`):
Whether or not to instantiate a fast tokenizer.

**kwargs
Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
[`PreTrainedTokenizer`]
"""
feature_extractor = LayoutLMv2FeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs)
if use_fast:
tokenizer = LayoutXLMTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
else:
tokenizer = LayoutXLMTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)

return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)
feature_extractor_class = "LayoutLMv2FeatureExtractor"
tokenizer_class = ("LayoutXLMTokenizer", "LayoutXLMTokenizerFast")

def __call__(
self,
Expand Down
Loading