-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC for a ProcessorMixin class #15549
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! This looks good to me.
@@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License. | |||
|
|||
# Processors | |||
|
|||
This library includes processors for several traditional tasks. These processors can be used to process a dataset into | |||
examples that can be fed to a model. | |||
Processors can mean two different things in the Transformers library: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reworking this part of the docs!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@@ -12,10 +12,22 @@ specific language governing permissions and limitations under the License. | |||
|
|||
# Processors | |||
|
|||
This library includes processors for several traditional tasks. These processors can be used to process a dataset into | |||
examples that can be fed to a model. | |||
Processors can mean two different things in the Transformers library: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reworking this part of the docs!
proper_class = getattr(transformers_module, class_name) | ||
|
||
if not isinstance(arg, proper_class): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great error message
|
||
setattr(self, attribute_name, arg) | ||
|
||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
# Include the processor class in the attribute config so this processor can then be reloaded with the | ||
# `AutoProcessor` API. | ||
if hasattr(attribute, "_set_processor_class"): | ||
attribute._set_processor_class(self.__class__.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test to make sure that every tokenizer & feature extractor has this function in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is defined by the base classes (FeatureExtractorMixin
and PreTrainedTokenizerBase
) so I don't think it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah true - this makes sense!
## Multi-modal processors | ||
|
||
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text, | ||
vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vision and speech). This is handled by objects called processors, which group tokenizer (for the text modaility) and | |
vision and speech). This is handled by objects called processors, which group tokenizers (for the text modality) and |
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
def __init__(self, *args, **kwargs): | ||
# Sanitize args and kwargs | ||
for key in kwargs: | ||
if key not in self.attributes: | ||
raise TypeError(f"Unexepcted keyword argument {key}.") | ||
for arg, attribute_name in zip(args, self.attributes): | ||
if attribute_name in kwargs: | ||
raise TypeError(f"Got multiple values for argument {attribute_name}.") | ||
else: | ||
kwargs[attribute_name] = arg | ||
|
||
if len(kwargs) != len(self.attributes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's smart!
* PoC for a ProcessorMixin class * Documentation * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Roll out to other processors * Add base feature extractor class in init * Use args and kwargs Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
from pathlib import Path | ||
|
||
|
||
# Comment to write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you find the need to duplicate the transformers
module in memory. This code executes it again, and instists on how it is loaded, for no obvious reason.
I do not recognize what the difference is to import transformers as transformers_module
can you explain?
What does this PR do?
This PR refactors some common saving/loading functionality of
Processor
s behind aProcessorMixin
class which implements thesave_pretrained
andfrom_pretrained
method.It's designed to work with main attributes (by default
feature_extractor
andtokenizer
, but that list can be changed or extended in subclasses) that are then required by the init and on whichsave_pretrained
/from_pretrained
is called when saving or loading the processor.It also handles having several classes for a given tokenizer (fast/not fast) so we can have that feature automatically handled. I showed how this works in refactoring two processors (I did not put the tokenizer fast in CLIP because there are some problems with it, but adding it is as easy as changing the
tokenizer_class
).I will refactor the other processors if the design suits everyone.