Add TrOCR + VisionEncoderDecoderModel #13874

NielsRogge · 2021-10-05T12:01:01Z

What does this PR do?

This PR adds the TrOCR models by Microsoft, together with a new VisionEncoderDecoderModel class (which should be used in order to use TrOCR, as it consists of an image encoder and an autoregressive text decoder). This PR is very similar to #13186, it's just the vision counterpart.

Here's how to use this model:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

generated_ids = model.generate(input_ids=pixel_values)
print(processor.batch_decode(generated_ids)[0])

There's also this Colab notebook for quick inference: https://colab.research.google.com/drive/1qCuqlqc4V9LZhPkxIi_XqCCTQrDhkmHi?usp=sharing

! A big disclaimer: the TrOCR models do not directly work on entire images of PDFs etc. They are trained on single-text-line images. One needs a text detector first, before applying TrOCR. TrOCR is a text recognition model, not a text detection model. One typically needs both in a sequence in order to extract all text from a given image.

Important note:

The current design of the existing EncoderDecoderModel/FlaxEncoderDecoderModel is that, if the hidden_size of the encoder/decoder don't match, one creates a single projection layer to project the encoder_hidden_states to the same number of channels as the decoder. However, for TrOCR, this is not how it's done. Instead, one projects the encoder_hidden_states to the same dimension as the decoder when projecting to keys and values, in each decoder layer. Therefore, my proposal is to add an attribute to the config of the decoder called encoder_hidden_size, which, if specified, will be used in the VisionEncoderDecoderModel class to not project the encoder hidden states. Instead, it will be used when instantiating the key and value projection layers.

For consistency, we could also add this to the existing EncoderDecoderModel/FlaxEncoderDecoderModel. Also relevant for the FlaxVisionEncoderDecoderModel PR, see #13359.

To do:

currently you have to pass input_ids=pixel_values to the generate method, which is not ideal. I'm in favor of letting the generate method accept an argument called inputs, which can work with text, images, speech. Related to Consistent speech model input names for the Seq2SeqTrainer generate function #13825.
make sure all vision models accept attention_mask as input. Related to Add attention-mask support for ViTModel #13764.

docs/source/model_doc/trocr.rst

docs/source/model_doc/visionencoderdecoder.rst

patrickvonplaten · 2021-10-07T17:24:33Z

src/transformers/models/deit/modeling_deit.py

@@ -476,6 +476,7 @@ class PreTrainedModel
    def forward(
        self,
        pixel_values=None,
+        attention_mask=None,


this isn't used anywhere no? Is it just here since attention_mask is often passed in generate()?

src/transformers/models/trocr/modeling_trocr.py

src/transformers/configuration_utils.py

src/transformers/models/trocr/modeling_trocr.py

src/transformers/models/trocr/processing_trocr.py

src/transformers/models/vision_encoder_decoder/configuration_vision_encoder_decoder.py

src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py

tests/test_modeling_trocr.py

patrickvonplaten

PR looks great to me! Really clean implementation that fits well with the current design of Transformers IMO - it should enable lots of image captioning tasks from pretrained ViT + BERT :-)
Also very nice that you didn't have to adapt generate() to make it work!

Just have the following points:

Do we need this attention_mask here: Add TrOCR + VisionEncoderDecoderModel #13874 (comment) ? I didn't dive to deep into the code, but if we need it just because generate(...) passes it then it's a bit hacky and we should try to avoid it. Happy to see how we can change generate for this or at least better add a kwargs(...) arguments that logs that the input is not used.
I feel quite strongly about not calling it encoder_hidden_size, but rather cross_attention_hidden_size. From a user that just looks at configuration_utils.py the name encoder_hidden-size is not at all related to encoder-decoder architectures. Can we change that maybe?
Can we add one slow integration test with a real model (maybe the one in your notebook?)

Overall amazing work! Think we can merge this in a couple of days :-)

sgugger

Thanks a lot for adding this!

README.md

docs/source/model_doc/visionencoderdecoder.rst

src/transformers/models/trocr/configuration_trocr.py

sgugger · 2021-10-08T13:29:33Z

src/transformers/models/trocr/modeling_trocr.py

+        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
+        if embedding_dim % 2 == 1:
+            # zero pad


This comment should be expanded or removed.

src/transformers/models/trocr/modeling_trocr.py

tests/test_modeling_trocr.py

tests/test_modeling_vision_encoder_decoder.py

LysandreJik

This looks great. Thanks for working on this, @NielsRogge!

LysandreJik · 2021-10-11T12:55:17Z

src/transformers/models/trocr/modeling_trocr.py

+        Build sinusoidal embeddings. This matches the implementation in tensor2tensor, but differs slightly from the
+        description in Section 3.5 of "Attention Is All You Need".


How does it differ?

@patrickvonplaten knows, I just copied his implementation

The same docstring is used in Speech2Text2

src/transformers/models/trocr/modeling_trocr.py

src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py

NielsRogge added 20 commits October 3, 2021 20:52

First draft

26b79a6

Update self-attention of RoBERTa as proposition

eebfafa

Improve conversion script

adf1cb3

Add TrOCR decoder-only model

be7ec13

More improvements

1ec88d5

Make forward pass with pretrained weights work

7ded83b

More improvements

9b4189f

Some more improvements

9b6f68b

More improvements

1127064

Make conversion work

ac5440d

Clean up print statements

6c5d947

Add documentation, processor

b54e32e

Add test files

d47b5f1

Small improvements

b1a85a6

Some more improvements

76f3a66

Make fix-copies, improve docs

1d8ed6b

Make all vision encoder decoder model tests pass

2c4337e

Make conversion script support other models

cc4eb2c

Update URL for OCR image

170f905

Update conversion script

28bdf18

NielsRogge requested review from LysandreJik and patrickvonplaten October 5, 2021 13:29

NielsRogge added 2 commits October 5, 2021 16:17

Fix style & quality

890dd70

Add support for the large-printed model

15f797d

LysandreJik requested a review from sgugger October 6, 2021 02:44

NielsRogge added 5 commits October 6, 2021 10:05

Fix some issues

f490e3a

Add print statement for debugging

2230eb0

Add print statements for debugging

f8ad61d

Make possible fix for sinusoidal embedding

e5f6983

Further debugging

643c21d