Skip to content

Commit edcb83d

Browse files
Logeswaran7stevhliu
authored andcommitted
Updated Model-card for donut (huggingface#37290)
* Updated documentation for Donut model * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Updated code suggestions * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Updated code suggestion to Align with the AutoModel example * Update docs/source/en/model_doc/donut.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Updated notes section included code examples * close hfoption block and indent --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 7c49cd4 commit edcb83d

File tree

1 file changed

+166
-155
lines changed

1 file changed

+166
-155
lines changed

docs/source/en/model_doc/donut.md

Lines changed: 166 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -13,180 +13,191 @@ rendered properly in your Markdown viewer.
1313
1414
specific language governing permissions and limitations under the License. -->
1515

16-
# Donut
17-
18-
## Overview
19-
20-
The Donut model was proposed in [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by
21-
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
22-
Donut consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform document understanding
23-
tasks such as document image classification, form understanding and visual question answering.
24-
25-
The abstract from the paper is the following:
26-
27-
*Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.*
28-
29-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg"
30-
alt="drawing" width="600"/>
16+
<div style="float: right;">
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
19+
</div>
20+
</div>
3121

32-
<small> Donut high-level overview. Taken from the <a href="https://arxiv.org/abs/2111.15664">original paper</a>. </small>
33-
34-
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
35-
[here](https://github.com/clovaai/donut).
22+
# Donut
3623

37-
## Usage tips
24+
[Donut (Document Understanding Transformer)](https://huggingface.co/papers2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.
3825

39-
- The quickest way to get started with Donut is by checking the [tutorial
40-
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model
41-
at inference time as well as fine-tuning on custom data.
42-
- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.
26+
Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences.
4327

44-
## Inference examples
28+
You can find all the original Donut checkpoints under the [Naver Clova Information Extraction](https://huggingface.co/naver-clova-ix) organization.
4529

46-
Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
47-
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
30+
> [!TIP]
31+
> Click on the Donut models in the right sidebar for more examples of how to apply Donut to different language and vision tasks.
4832
49-
The [`DonutImageProcessor`] class is responsible for preprocessing the input image and
50-
[`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The
51-
[`DonutProcessor`] wraps [`DonutImageProcessor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]
52-
into a single instance to both extract the input features and decode the predicted token ids.
33+
The examples below demonstrate how to perform document understanding tasks using Donut with [`Pipeline`] and [`AutoModel`]
5334

54-
- Step-by-step Document Image Classification
35+
<hfoptions id="usage">
36+
<hfoption id="Pipeline">
5537

5638
```py
57-
>>> import re
58-
59-
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
60-
>>> from datasets import load_dataset
61-
>>> import torch
62-
63-
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
64-
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
65-
66-
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
67-
>>> model.to(device) # doctest: +IGNORE_RESULT
68-
69-
>>> # load document image
70-
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
71-
>>> image = dataset[1]["image"]
72-
73-
>>> # prepare decoder inputs
74-
>>> task_prompt = "<s_rvlcdip>"
75-
>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
76-
77-
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
78-
79-
>>> outputs = model.generate(
80-
... pixel_values.to(device),
81-
... decoder_input_ids=decoder_input_ids.to(device),
82-
... max_length=model.decoder.config.max_position_embeddings,
83-
... pad_token_id=processor.tokenizer.pad_token_id,
84-
... eos_token_id=processor.tokenizer.eos_token_id,
85-
... use_cache=True,
86-
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
87-
... return_dict_in_generate=True,
88-
... )
89-
90-
>>> sequence = processor.batch_decode(outputs.sequences)[0]
91-
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
92-
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
93-
>>> print(processor.token2json(sequence))
94-
{'class': 'advertisement'}
39+
# pip install datasets
40+
import torch
41+
from transformers import pipeline
42+
from PIL import Image
43+
44+
pipeline = pipeline(
45+
task="document-question-answering",
46+
model="naver-clova-ix/donut-base-finetuned-docvqa",
47+
device=0,
48+
torch_dtype=torch.float16
49+
)
50+
dataset = load_dataset("hf-internal-testing/example-documents", split="test")
51+
image = dataset[0]["image"]
52+
53+
pipeline(image=image, question="What time is the coffee break?")
9554
```
9655

97-
- Step-by-step Document Parsing
56+
</hfoption>
57+
<hfoption id="AutoModel">
9858

9959
```py
100-
>>> import re
101-
102-
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
103-
>>> from datasets import load_dataset
104-
>>> import torch
105-
106-
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
107-
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
108-
109-
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
110-
>>> model.to(device) # doctest: +IGNORE_RESULT
111-
112-
>>> # load document image
113-
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
114-
>>> image = dataset[2]["image"]
115-
116-
>>> # prepare decoder inputs
117-
>>> task_prompt = "<s_cord-v2>"
118-
>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
119-
120-
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
121-
122-
>>> outputs = model.generate(
123-
... pixel_values.to(device),
124-
... decoder_input_ids=decoder_input_ids.to(device),
125-
... max_length=model.decoder.config.max_position_embeddings,
126-
... pad_token_id=processor.tokenizer.pad_token_id,
127-
... eos_token_id=processor.tokenizer.eos_token_id,
128-
... use_cache=True,
129-
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
130-
... return_dict_in_generate=True,
131-
... )
132-
133-
>>> sequence = processor.batch_decode(outputs.sequences)[0]
134-
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
135-
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
136-
>>> print(processor.token2json(sequence))
137-
{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
60+
# pip install datasets
61+
import torch
62+
from datasets import load_dataset
63+
from transformers import AutoProcessor, AutoModelForVision2Seq
64+
65+
processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
66+
model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
67+
68+
dataset = load_dataset("hf-internal-testing/example-documents", split="test")
69+
image = dataset[0]["image"]
70+
question = "What time is the coffee break?"
71+
task_prompt = f"<s_docvqa><s_question>{question}</s_question><s_answer>"
72+
inputs = processor(image, task_prompt, return_tensors="pt")
73+
74+
outputs = model.generate(
75+
input_ids=inputs.input_ids,
76+
pixel_values=inputs.pixel_values,
77+
max_length=512
78+
)
79+
answer = processor.decode(outputs[0], skip_special_tokens=True)
80+
print(answer)
13881
```
13982

140-
- Step-by-step Document Visual Question Answering (DocVQA)
83+
</hfoption>
84+
</hfoptions>
14185

142-
```py
143-
>>> import re
144-
145-
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
146-
>>> from datasets import load_dataset
147-
>>> import torch
148-
149-
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
150-
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
151-
152-
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
153-
>>> model.to(device) # doctest: +IGNORE_RESULT
154-
155-
>>> # load document image from the DocVQA dataset
156-
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
157-
>>> image = dataset[0]["image"]
158-
159-
>>> # prepare decoder inputs
160-
>>> task_prompt = "<s_docvqa><s_question>{user_input}</s_question><s_answer>"
161-
>>> question = "When is the coffee break?"
162-
>>> prompt = task_prompt.replace("{user_input}", question)
163-
>>> decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids
164-
165-
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
166-
167-
>>> outputs = model.generate(
168-
... pixel_values.to(device),
169-
... decoder_input_ids=decoder_input_ids.to(device),
170-
... max_length=model.decoder.config.max_position_embeddings,
171-
... pad_token_id=processor.tokenizer.pad_token_id,
172-
... eos_token_id=processor.tokenizer.eos_token_id,
173-
... use_cache=True,
174-
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
175-
... return_dict_in_generate=True,
176-
... )
177-
178-
>>> sequence = processor.batch_decode(outputs.sequences)[0]
179-
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
180-
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
181-
>>> print(processor.token2json(sequence))
182-
{'question': 'When is the coffee break?', 'answer': '11-14 to 11:39 a.m.'}
183-
```
86+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
18487

185-
See the [model hub](https://huggingface.co/models?filter=donut) to look for Donut checkpoints.
88+
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
18689

187-
## Training
90+
```py
91+
# pip install datasets torchao
92+
import torch
93+
from datasets import load_dataset
94+
from transformers import TorchAoConfig, AutoProcessor, AutoModelForVision2Seq
95+
96+
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
97+
processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
98+
model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)
99+
100+
dataset = load_dataset("hf-internal-testing/example-documents", split="test")
101+
image = dataset[0]["image"]
102+
question = "What time is the coffee break?"
103+
task_prompt = f"<s_docvqa><s_question>{question}</s_question><s_answer>"
104+
inputs = processor(image, task_prompt, return_tensors="pt")
105+
106+
outputs = model.generate(
107+
input_ids=inputs.input_ids,
108+
pixel_values=inputs.pixel_values,
109+
max_length=512
110+
)
111+
answer = processor.decode(outputs[0], skip_special_tokens=True)
112+
print(answer)
113+
```
188114

189-
We refer to the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut).
115+
## Notes
116+
117+
- Use Donut for document image classification as shown below.
118+
119+
```py
120+
>>> import re
121+
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
122+
>>> from datasets import load_dataset
123+
>>> import torch
124+
125+
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
126+
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
127+
128+
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
129+
>>> model.to(device) # doctest: +IGNORE_RESULT
130+
131+
>>> # load document image
132+
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
133+
>>> image = dataset[1]["image"]
134+
135+
>>> # prepare decoder inputs
136+
>>> task_prompt = "<s_rvlcdip>"
137+
>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
138+
139+
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
140+
141+
>>> outputs = model.generate(
142+
... pixel_values.to(device),
143+
... decoder_input_ids=decoder_input_ids.to(device),
144+
... max_length=model.decoder.config.max_position_embeddings,
145+
... pad_token_id=processor.tokenizer.pad_token_id,
146+
... eos_token_id=processor.tokenizer.eos_token_id,
147+
... use_cache=True,
148+
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
149+
... return_dict_in_generate=True,
150+
... )
151+
152+
>>> sequence = processor.batch_decode(outputs.sequences)[0]
153+
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
154+
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
155+
>>> print(processor.token2json(sequence))
156+
{'class': 'advertisement'}
157+
```
158+
159+
- Use Donut for document parsing as shown below.
160+
161+
```py
162+
>>> import re
163+
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
164+
>>> from datasets import load_dataset
165+
>>> import torch
166+
167+
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
168+
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
169+
170+
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
171+
>>> model.to(device) # doctest: +IGNORE_RESULT
172+
173+
>>> # load document image
174+
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
175+
>>> image = dataset[2]["image"]
176+
177+
>>> # prepare decoder inputs
178+
>>> task_prompt = "<s_cord-v2>"
179+
>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
180+
181+
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
182+
183+
>>> outputs = model.generate(
184+
... pixel_values.to(device),
185+
... decoder_input_ids=decoder_input_ids.to(device),
186+
... max_length=model.decoder.config.max_position_embeddings,
187+
... pad_token_id=processor.tokenizer.pad_token_id,
188+
... eos_token_id=processor.tokenizer.eos_token_id,
189+
... use_cache=True,
190+
... bad_words_ids=[[processor.tokenizer.unk_token_id]],
191+
... return_dict_in_generate=True,
192+
... )
193+
194+
>>> sequence = processor.batch_decode(outputs.sequences)[0]
195+
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
196+
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
197+
>>> print(processor.token2json(sequence))
198+
{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total':
199+
{'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
200+
```
190201

191202
## DonutSwinConfig
192203

0 commit comments

Comments
 (0)