Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TrOCR + VisionEncoderDecoderModel #13874

Merged
merged 38 commits into from
Oct 13, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
26b79a6
First draft
NielsRogge Sep 25, 2021
eebfafa
Update self-attention of RoBERTa as proposition
NielsRogge Sep 29, 2021
adf1cb3
Improve conversion script
NielsRogge Sep 30, 2021
be7ec13
Add TrOCR decoder-only model
NielsRogge Sep 30, 2021
1ec88d5
More improvements
NielsRogge Sep 30, 2021
7ded83b
Make forward pass with pretrained weights work
NielsRogge Sep 30, 2021
9b4189f
More improvements
NielsRogge Sep 30, 2021
9b6f68b
Some more improvements
NielsRogge Sep 30, 2021
1127064
More improvements
NielsRogge Sep 30, 2021
ac5440d
Make conversion work
NielsRogge Oct 3, 2021
6c5d947
Clean up print statements
NielsRogge Oct 4, 2021
b54e32e
Add documentation, processor
NielsRogge Oct 4, 2021
d47b5f1
Add test files
NielsRogge Oct 4, 2021
b1a85a6
Small improvements
NielsRogge Oct 4, 2021
76f3a66
Some more improvements
NielsRogge Oct 4, 2021
1d8ed6b
Make fix-copies, improve docs
NielsRogge Oct 4, 2021
2c4337e
Make all vision encoder decoder model tests pass
NielsRogge Oct 4, 2021
cc4eb2c
Make conversion script support other models
NielsRogge Oct 5, 2021
170f905
Update URL for OCR image
NielsRogge Oct 5, 2021
28bdf18
Update conversion script
NielsRogge Oct 5, 2021
890dd70
Fix style & quality
NielsRogge Oct 5, 2021
15f797d
Add support for the large-printed model
NielsRogge Oct 5, 2021
f490e3a
Fix some issues
NielsRogge Oct 6, 2021
2230eb0
Add print statement for debugging
NielsRogge Oct 6, 2021
f8ad61d
Add print statements for debugging
NielsRogge Oct 6, 2021
e5f6983
Make possible fix for sinusoidal embedding
NielsRogge Oct 6, 2021
643c21d
Further debugging
NielsRogge Oct 6, 2021
b7c5bf8
Potential fix v2
NielsRogge Oct 6, 2021
6c4435d
Add more print statements for debugging
NielsRogge Oct 6, 2021
1a6825f
Add more print statements for debugging
NielsRogge Oct 6, 2021
667b03c
Deubg more
NielsRogge Oct 6, 2021
bf49483
Comment out print statements
NielsRogge Oct 6, 2021
f0c8b59
Make conversion of large printed model possible, address review comments
NielsRogge Oct 8, 2021
6f1d7fa
Make it possible to convert the stage1 checkpoints
NielsRogge Oct 8, 2021
c38904b
Clean up code, apply suggestions from code review
NielsRogge Oct 8, 2021
6e6b947
Apply suggestions from code review, use Microsoft models in tests
NielsRogge Oct 11, 2021
b1fedab
Rename encoder_hidden_size to cross_attention_hidden_size
NielsRogge Oct 11, 2021
f3d9e94
Improve docs
NielsRogge Oct 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Improve docs
  • Loading branch information
NielsRogge committed Oct 12, 2021
commit f3d9e9483d8d6b915260880312cdd58518e68cf4
9 changes: 5 additions & 4 deletions docs/source/model_doc/trocr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,10 @@ The original code can be found `here

Tips:

- TrOCR achieves state-of-the-art results on both printed and handwritten text recognition tasks, such as the `IAM
Handwriting dataset <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>`__. For more information, see the
`official models <https://huggingface.co/models?other=trocr>`__.
- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results
on both printed (e.g. the `SROIE dataset <https://paperswithcode.com/dataset/sroie>`__) and handwritten (e.g. the
`IAM Handwriting dataset <https://fki.tic.heia-fr.ch/databases/iam-handwriting-database>`__) text recognition tasks.
For more information, see the `official models <https://huggingface.co/models?other=trocr>`__.
- TrOCR is always used within the :doc:`VisionEncoderDecoder <visionencoderdecoder>` framework.

Inference
Expand Down Expand Up @@ -67,7 +68,7 @@ predicted token ids.
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values)

>>> generated_text = processor.batch_decode(generated_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]


See the `model hub <https://huggingface.co/models?filter=trocr>`__ to look for TrOCR checkpoints.
Expand Down
4 changes: 0 additions & 4 deletions src/transformers/models/trocr/modeling_trocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,10 +152,6 @@ def create_position_ids_from_input_ids(
"""
Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding
symbols are ignored. This is modified from fairseq's `utils.make_positions`.

Args:
x: torch.Tensor x:
Returns: torch.Tensor
"""
# The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
mask = input_ids.ne(padding_idx).int()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -245,11 +245,10 @@ def from_encoder_decoder_pretrained(

Params:
encoder_pretrained_model_name_or_path (:obj: `str`, `optional`):
Information necessary to initiate the encoder. Can be either:
Information necessary to initiate the image encoder. Can be either:

- A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under
a user or organization name, like ``dbmdz/bert-base-german-cased``.
- A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co. An
example is ``google/vit-base-patch16-224-in21k``.
- A path to a `directory` containing model weights saved using
:func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
- A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
Expand All @@ -258,7 +257,7 @@ def from_encoder_decoder_pretrained(
a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.

decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
Information necessary to initiate the decoder. Can be either:
Information necessary to initiate the text decoder. Can be either:

- A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under
Expand Down Expand Up @@ -400,10 +399,10 @@ def forward(
>>> from PIL import Image
>>> import torch

>>> processor = TrOCRProcessor.from_pretrained('microsoft/tr-ocr-base-iam')
>>> model = VisionEncoderDecoderModel.from_pretrained('microsoft/tr-ocr-base-iam')
>>> processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
>>> model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')

>>> # load image
>>> # load image from the IAM dataset
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

Expand All @@ -414,7 +413,7 @@ def forward(

>>> # inference (generation)
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = processor.batch_decode(generated_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
Expand Down