PoC for a ProcessorMixin class #15549

sgugger · 2022-02-07T20:28:47Z

What does this PR do?

This PR refactors some common saving/loading functionality of Processors behind a ProcessorMixin class which implements the save_pretrained and from_pretrained method.

It's designed to work with main attributes (by default feature_extractor and tokenizer, but that list can be changed or extended in subclasses) that are then required by the init and on which save_pretrained/from_pretrained is called when saving or loading the processor.

It also handles having several classes for a given tokenizer (fast/not fast) so we can have that feature automatically handled. I showed how this works in refactoring two processors (I did not put the tokenizer fast in CLIP because there are some problems with it, but adding it is as easy as changing the tokenizer_class).
I will refactor the other processors if the design suits everyone.

HuggingFaceDocBuilder · 2022-02-07T20:29:07Z

The documentation is not available anymore as the PR was closed or merged.

docs/source/main_classes/processors.mdx

patil-suraj

Very cool! This looks good to me.

src/transformers/processing_utils.py

patrickvonplaten · 2022-02-08T14:13:57Z

docs/source/main_classes/processors.mdx

@@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License.

 # Processors

-This library includes processors for several traditional tasks. These processors can be used to process a dataset into
-examples that can be fed to a model.
+Processors can mean two different things in the Transformers library:


Thanks for reworking this part of the docs!

LysandreJik

Looks good to me!

LysandreJik · 2022-02-08T14:15:53Z

docs/source/main_classes/processors.mdx

@@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License.

 # Processors

-This library includes processors for several traditional tasks. These processors can be used to process a dataset into
-examples that can be fed to a model.
+Processors can mean two different things in the Transformers library:


Thanks for reworking this part of the docs!

patrickvonplaten · 2022-02-08T14:19:51Z

src/transformers/processing_utils.py

+                proper_class = getattr(transformers_module, class_name)
+
+            if not isinstance(arg, proper_class):
+                raise ValueError(


Great error message

src/transformers/processing_utils.py

patrickvonplaten · 2022-02-08T14:21:37Z

src/transformers/processing_utils.py

+
+            setattr(self, attribute_name, arg)
+
+    def __repr__(self):


patrickvonplaten · 2022-02-08T14:23:56Z

src/transformers/processing_utils.py

+            # Include the processor class in the attribute config so this processor can then be reloaded with the
+            # `AutoProcessor` API.
+            if hasattr(attribute, "_set_processor_class"):
+                attribute._set_processor_class(self.__class__.__name__)


Should we add a test to make sure that every tokenizer & feature extractor has this function in a follow up PR?

It is defined by the base classes (FeatureExtractorMixin and PreTrainedTokenizerBase) so I don't think it's necessary.

ah yeah true - this makes sense!

docs/source/main_classes/processors.mdx

NielsRogge · 2022-02-08T14:27:31Z

docs/source/main_classes/processors.mdx

+## Multi-modal processors
+
+Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
+vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and


Suggested change

vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and

vision and speech). This is handled by objects called processors, which group tokenizers (for the text modality) and

docs/source/main_classes/processors.mdx

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

LysandreJik · 2022-02-09T14:23:17Z

src/transformers/processing_utils.py

+    def __init__(self, *args, **kwargs):
+        # Sanitize args and kwargs
+        for key in kwargs:
+            if key not in self.attributes:
+                raise TypeError(f"Unexepcted keyword argument {key}.")
+        for arg, attribute_name in zip(args, self.attributes):
+            if attribute_name in kwargs:
+                raise TypeError(f"Got multiple values for argument {attribute_name}.")
+            else:
+                kwargs[attribute_name] = arg
+
+        if len(kwargs) != len(self.attributes):


That's smart!

* PoC for a ProcessorMixin class * Documentation * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Roll out to other processors * Add base feature extractor class in init * Use args and kwargs Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

kayhayen · 2023-07-23T11:56:50Z

src/transformers/processing_utils.py

+from pathlib import Path
+
+
+# Comment to write


Why did you find the need to duplicate the transformers module in memory. This code executes it again, and instists on how it is loaded, for no obvious reason.

I do not recognize what the difference is to import transformers as transformers_module can you explain?

PoC for a ProcessorMixin class

af76d4d

sgugger requested review from patrickvonplaten, patil-suraj and LysandreJik February 7, 2022 20:28

Documentation

0272f5e

NielsRogge reviewed Feb 7, 2022

View reviewed changes

docs/source/main_classes/processors.mdx Outdated Show resolved Hide resolved

patil-suraj approved these changes Feb 8, 2022

View reviewed changes

src/transformers/processing_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Feb 8, 2022

View reviewed changes

LysandreJik approved these changes Feb 8, 2022

View reviewed changes

patrickvonplaten reviewed Feb 8, 2022

View reviewed changes

src/transformers/processing_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Feb 8, 2022

View reviewed changes

src/transformers/processing_utils.py

setattr(self, attribute_name, arg)

def __repr__(self):

Copy link

Contributor

patrickvonplaten Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

patrickvonplaten reviewed Feb 8, 2022

View reviewed changes

NielsRogge reviewed Feb 8, 2022

View reviewed changes

docs/source/main_classes/processors.mdx Outdated Show resolved Hide resolved

NielsRogge reviewed Feb 8, 2022

View reviewed changes

docs/source/main_classes/processors.mdx Outdated Show resolved Hide resolved

sgugger and others added 4 commits February 8, 2022 09:57

Apply suggestions from code review

f5c85af

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Roll out to other processors

df997ea

Add base feature extractor class in init

e878a54

Use args and kwargs

89ea8dc

LysandreJik approved these changes Feb 9, 2022

View reviewed changes

sgugger merged commit b5c6fde into master Feb 9, 2022

sgugger deleted the processor_mixin branch February 9, 2022 14:24

NielsRogge mentioned this pull request Feb 14, 2022

Fix processors #15280

Closed

2 tasks

kayhayen reviewed Jul 23, 2023

View reviewed changes

	vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and
	vision and speech). This is handled by objects called processors, which group tokenizers (for the text modality) and

PoC for a ProcessorMixin class #15549

PoC for a ProcessorMixin class #15549

Uh oh!

Conversation

sgugger commented Feb 7, 2022

What does this PR do?

Uh oh!

HuggingFaceDocBuilder commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NielsRogge Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilder commented Feb 7, 2022 •

edited

Loading

NielsRogge Feb 8, 2022 •

edited

Loading