This is a a script made for finetuning a Donut model on a batch of receipts in Croatian, AFTER finetuning the base Donut model on the SROIE dataset using HuggingFace's Transformer library. The script is adapted (well, swiped) from Phil Schmid https://www.philschmid.de/fine-tuning-donut .
The Donut model consists of a text transformer (BERT) plus a Vision Transformer (SWIN). Luckily the text part is multilingual so it does ok with Croatian language.
The task we're looking at in this example is Document Visual Question Answering (or DocVQA in short).
The SROIE dataset consists of about 1000 images (624 in the end) that are paired with K-V list of items in the receipt that we want to train our model on - so no OCR is done beforehand, there is no bounding boxes drawing needed, we just tell the model that certain pieces of info are on the receipts and that we want to find what their values are on a specific receipt.
The hard part was collecting store receipts (and some other similar documents, receipts for non-store services, toll receipts...) in Croatian, and of course labeling. The collecting part was done over the summer with generous help from other studymates, and from around 250 collected receipts I ended up using around 130.
"Labeling" I did solo, and it ended up taking considerable time.
So we have two scripts in the repository - the first one is responsible for processing the SROIE dataset, finetuning the base Donut model on that data, and then uploading the result to HuggingFace. The second one uses the result model from the first process and finetunes it one step further, after processing our dataset consisting of receipts in Croatian and uploads the final model checkpoint to Huggingface.
The end result can be found on HuggingFace of course, keeping it all open source in spirit https://huggingface.co/oxioxi/donut-base-sroie-v1.5.
The python notebook files are helpers for a presentation I did at the end of the project, using nbconvert package that can "serve" notebooks are webpages.