Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC for a ProcessorMixin class #15549

Merged
merged 6 commits into from
Feb 9, 2022
Merged

PoC for a ProcessorMixin class #15549

merged 6 commits into from
Feb 9, 2022

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Feb 7, 2022

What does this PR do?

This PR refactors some common saving/loading functionality of Processors behind a ProcessorMixin class which implements the save_pretrained and from_pretrained method.

It's designed to work with main attributes (by default feature_extractor and tokenizer, but that list can be changed or extended in subclasses) that are then required by the init and on which save_pretrained/from_pretrained is called when saving or loading the processor.

It also handles having several classes for a given tokenizer (fast/not fast) so we can have that feature automatically handled. I showed how this works in refactoring two processors (I did not put the tokenizer fast in CLIP because there are some problems with it, but adding it is as easy as changing the tokenizer_class).
I will refactor the other processors if the design suits everyone.

@HuggingFaceDocBuilder
Copy link

HuggingFaceDocBuilder commented Feb 7, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! This looks good to me.

@@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License.

# Processors

This library includes processors for several traditional tasks. These processors can be used to process a dataset into
examples that can be fed to a model.
Processors can mean two different things in the Transformers library:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking this part of the docs!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License.

# Processors

This library includes processors for several traditional tasks. These processors can be used to process a dataset into
examples that can be fed to a model.
Processors can mean two different things in the Transformers library:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking this part of the docs!

proper_class = getattr(transformers_module, class_name)

if not isinstance(arg, proper_class):
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great error message


setattr(self, attribute_name, arg)

def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

# Include the processor class in the attribute config so this processor can then be reloaded with the
# `AutoProcessor` API.
if hasattr(attribute, "_set_processor_class"):
attribute._set_processor_class(self.__class__.__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test to make sure that every tokenizer & feature extractor has this function in a follow up PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is defined by the base classes (FeatureExtractorMixin and PreTrainedTokenizerBase) so I don't think it's necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah true - this makes sense!

## Multi-modal processors

Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and
Copy link
Contributor

@NielsRogge NielsRogge Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and
vision and speech). This is handled by objects called processors, which group tokenizers (for the text modality) and

sgugger and others added 4 commits February 8, 2022 09:57
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Comment on lines +47 to +58
def __init__(self, *args, **kwargs):
# Sanitize args and kwargs
for key in kwargs:
if key not in self.attributes:
raise TypeError(f"Unexepcted keyword argument {key}.")
for arg, attribute_name in zip(args, self.attributes):
if attribute_name in kwargs:
raise TypeError(f"Got multiple values for argument {attribute_name}.")
else:
kwargs[attribute_name] = arg

if len(kwargs) != len(self.attributes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's smart!

@sgugger sgugger merged commit b5c6fde into master Feb 9, 2022
@sgugger sgugger deleted the processor_mixin branch February 9, 2022 14:24
@NielsRogge NielsRogge mentioned this pull request Feb 14, 2022
2 tasks
stevhliu pushed a commit to stevhliu/transformers that referenced this pull request Feb 18, 2022
* PoC for a ProcessorMixin class

* Documentation

* Apply suggestions from code review

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Roll out to other processors

* Add base feature extractor class in init

* Use args and kwargs

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
from pathlib import Path


# Comment to write

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you find the need to duplicate the transformers module in memory. This code executes it again, and instists on how it is loaded, for no obvious reason.

I do not recognize what the difference is to import transformers as transformers_module can you explain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants