-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TrOCR + VisionEncoderDecoderModel #13874
Conversation
@@ -476,6 +476,7 @@ class PreTrainedModel | |||
def forward( | |||
self, | |||
pixel_values=None, | |||
attention_mask=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't used anywhere no? Is it just here since attention_mask
is often passed in generate()
?
src/transformers/models/vision_encoder_decoder/configuration_vision_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks great to me! Really clean implementation that fits well with the current design of Transformers
IMO - it should enable lots of image captioning tasks from pretrained ViT + BERT :-)
Also very nice that you didn't have to adapt generate()
to make it work!
Just have the following points:
- Do we need this
attention_mask
here: Add TrOCR + VisionEncoderDecoderModel #13874 (comment) ? I didn't dive to deep into the code, but if we need it just becausegenerate(...)
passes it then it's a bit hacky and we should try to avoid it. Happy to see how we can change generate for this or at least better add akwargs(...)
arguments that logs that the input is not used. - I feel quite strongly about not calling it
encoder_hidden_size
, but rathercross_attention_hidden_size
. From a user that just looks atconfiguration_utils.py
the nameencoder_hidden-size
is not at all related to encoder-decoder architectures. Can we change that maybe? - Can we add one slow integration test with a real model (maybe the one in your notebook?)
Overall amazing work! Think we can merge this in a couple of days :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this!
emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0) | ||
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1) | ||
if embedding_dim % 2 == 1: | ||
# zero pad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be expanded or removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks for working on this, @NielsRogge!
Build sinusoidal embeddings. This matches the implementation in tensor2tensor, but differs slightly from the | ||
description in Section 3.5 of "Attention Is All You Need". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten knows, I just copied his implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same docstring is used in Speech2Text2
What does this PR do?
This PR adds the TrOCR models by Microsoft, together with a new
VisionEncoderDecoderModel
class (which should be used in order to use TrOCR, as it consists of an image encoder and an autoregressive text decoder). This PR is very similar to #13186, it's just the vision counterpart.Here's how to use this model:
There's also this Colab notebook for quick inference: https://colab.research.google.com/drive/1qCuqlqc4V9LZhPkxIi_XqCCTQrDhkmHi?usp=sharing
! A big disclaimer: the TrOCR models do not directly work on entire images of PDFs etc. They are trained on single-text-line images. One needs a text detector first, before applying TrOCR. TrOCR is a text recognition model, not a text detection model. One typically needs both in a sequence in order to extract all text from a given image.
Important note:
The current design of the existing
EncoderDecoderModel
/FlaxEncoderDecoderModel
is that, if thehidden_size
of the encoder/decoder don't match, one creates a single projection layer to project theencoder_hidden_states
to the same number of channels as the decoder. However, for TrOCR, this is not how it's done. Instead, one projects theencoder_hidden_states
to the same dimension as the decoder when projecting to keys and values, in each decoder layer. Therefore, my proposal is to add an attribute to the config of the decoder calledencoder_hidden_size
, which, if specified, will be used in theVisionEncoderDecoderModel
class to not project the encoder hidden states. Instead, it will be used when instantiating the key and value projection layers.For consistency, we could also add this to the existing EncoderDecoderModel/FlaxEncoderDecoderModel. Also relevant for the FlaxVisionEncoderDecoderModel PR, see #13359.
To do:
input_ids=pixel_values
to thegenerate
method, which is not ideal. I'm in favor of letting the generate method accept an argument calledinputs
, which can work with text, images, speech. Related to Consistent speech model input names for the Seq2SeqTrainer generate function #13825.attention_mask
as input. Related to Add attention-mask support for ViTModel #13764.