Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Latest commit

 

History

History
70 lines (51 loc) · 22 KB

README.md

File metadata and controls

70 lines (51 loc) · 22 KB

Awesome receipt data extraction

This repository contains resources helpful if you are going to build a system for key information extraction from photos of receipts.

Disclaimer

Quotes and images of publications listed below, which are available in this GitHub repository are shared here for educational purpose only. I don't own any copyrights for these publications. If you want me to delete your publication from this list and repository - please open an issue in this repository.

List of publications

Year Type of document Title, authors Works on Dataset, quantity, country of origin Receipt detection Receipt localization Receipt normalization Text line segmentation Optical character recognition Semantic analysis
2019.12 Preprint LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
scanned documents images with text segments and their position from OCR IIT-CDIP
6kk
✔️
2019.09 Workshop Paper Post-OCR parsing: building simple and robust parser via BIO tagging
Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, Hwalsuk Lee
receipts' text segments with position from OCR CORD
1000
✔️
2019.09 Workshop Paper Chargrid-OCR: End-to-end Trainable Optical Character Recognition for Printed Documents using Instance Segmentation
Christian Reisswig, Anoop R Katti, Marco Spinaci, Johannes Höhne
printed documents Proprietary
unknown synth + 43k real with noisy labels
✔️
2019.09 Conference Paper EATEN: Entity-aware Attention for Single Shot Visual Text Extraction
He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding
train ticket photos and synthetic images of  train tickets, passports and business cards EATEN
2000 real train ticket + synth: 300k train ticket + 100k passport + 200k business card
✔️
2019.09 Conference Paper End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net
Tuan Anh Nguyen Dang, Dat Nguyen Thanh
scanned invoices' and receipts' text with char-level bounding boxes from OCR Toyota invoices dataset
261
+
Daiichi medical receipts dataset
200
✔️
2019.09 Conference Paper (ICDAR) Attend, Copy, Parse End-to-end Information Extraction from Documents
Rasmus Berg Palm, Florian Laws, Ole Winther
scanned and digitalized invoices text with char-level bounding boxes from OCR Proprietary
1.2kk
✔️
2019.09 Bachelor's thesis Separation and Extraction of Valuable Information From Digital Receipts Using Google Cloud Vision OCR
Elias Johansson
photos of receipts Proprietary
53
✔️ ✔️
2019.08 Conference Paper Towards Unconstrained End-to-End Text Spotting
Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, Ying Xiao
photos of scenes with naturalistic text Proprietary, SynthText, ICDAR15, COCO-Text, ICDAR-MLT and Total-Text
30k, 200k, 1k, 17k, 7k and 1255
✔️ ✔️
2019.07 Conference Paper (CBMI) Receipt automatic reader
Olga Maslova, Louis Klein, Damien Dabernat, A Benoit, Patrick Lambert
photos of receipts Proprietary
1200 (receipt detection and segmentation)
+
15 (text recognition quality)
✔️ ✔️ ✔️ ✔️
2019.06 Preprint CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor
Xiaohui Zhao, Endi Niu, Zhuo Wu, and Xiaoguang Wang
receipts' text from OCR Proprietary
4484, Spain
+
SROIE 2019
1000
✔️
2019.06 Conference Paper A Multitask Network for Localization and Recognition of Text in Images
Mohammad Reza Sarshogh, Keegan E. Hines
synthetically-generated documents Proprietary
10000
✔️ ✔️
2019.06 Journal Article Visual-Linguistic Methods for Receipt Field Recognition
Rinon Gal, Nimrod Morag, Roy Shilkrot
scanned invoices' and receipts' text with char-level bounding boxes from OCR Proprietary
5094
✔️
2019.05 Conference Paper Deep Learning Approach for Receipt Recognition
Le Duc, Anh & Pham, Dung & Nguyen, Tuan
scanned receipts SROIE 2019
1000
✔️ ✔️ ✔️
2019.04 Conference Paper (ESANN) A document detection technique using convolutional neural networks for optical character recognition systems
Lorand Dobai, Mihai Teletin
photos of receipts Proprietary
6700
✔️ ✔️
2019.03 Conference Paper Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao
receipts' text segments from OCR Value-Added Tax Invoices (VATI)
3000
+
International Pur- chase Receipts (IPR)
1500
✔️
2018.11 Conference Paper (ICPR) A Novel Integrated Framework for Learning both Text Detection and Recognition
Wanchen Sui, Qing Zhang, Jun Yang, Wei Chu
business card photographs and scanned handwritten text Chinese Business Card Database
20k
+
IAM Handwriting Database
747
✔️ ✔️
2018.08 Conference Paper Chargrid: Towards Understanding 2D Documents
Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, Jean Baptiste Faddoul
scanned invoices' text with char-level bounding boxes from OCR Proprietary
12000
✔️
2018.03 Conference Paper Optical Character Recognition Engine to extract Food-items and Prices from Grocery Receipt Images via Templating and Dictionary-Traversal Technique
Ali Sohani, Rafi Ullah, Faraz Ali, Athaul Rai, Richard Messier
photos of receipts N/A ✔️ ✔️ ✔️
2018.02 Journal Article OCR Engine to Extract Food-Items, Prices, Quantity, Units from Receipt Images, Heuristics Rules Based Approach
Rafi Ullah, Ali Sohani, Athaul Rai, Faraz Ali, Richard Messier
photos of receipts N/A ✔️ ✔️ ✔️
2018 BSc thesis Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms
Joel Odd, Emil Theologou
receipts' text from OCR Proprietary
556, Sweden
✔️
2018 Journal Article Preprocessing Photos of Receipts for Recognition
Wojciech Korobacz, Marek Tabędzki
photos of receipts Proprietary
240
✔️ ✔️
2018 Preprint Automated Receipt Image Identification, Cropping, and Parsing
Alex Yue
photos of receipts Proprietary
50
✔️ ✔️ ✔️
2017.12 Conference Paper OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach
Rafi Ullah, Ali Sohani, Faraz Ali, Athaul Rai
photos of receipts N/A ✔️ ✔️ ✔️
2017.10 Conference Paper Deep Learning for automatic sale receipt understanding
Rizlene Raoui-Outach, Cecile Million-Rousseau , Alexandre Benoit and Patrick Lambert
photos of receipts Proprietary
3000
✔️ ✔️ ✔️ ✔️
2017.09 Conference Paper (ICPR) Fused Text Segmentation Networks for Multi-oriented Scene Text Detection
Yuchen Dai, Zheng Huang, Yuting Gao, Youxuan Xu, Kai Chen, Jie Guo, Weidong Qiu
photos of scenes with naturalistic text SynthText
160k
✔️
2016.07 Bachelor's thesis Optical Character Recognition on supermarket receipts
Marco Ziegaus
scanned receipts Proprietary
39
✔️ ✔️ ✔️ ✔️
2015.08 Journal Article OCR accuracy improvement on document images through a novel pre-processing approach
Abdeslam El Harraj, Naoufal Raissouni
scanned documents MTDB
500
✔️
2015 Preprint Mobile Scanner and OCR (A first step towards receipt to spreadsheet)
Clement Ntwari Nshuti
photos of documents Proprietary
77
✔️ ✔️
2014 Preprint A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images
Roland Szabo
photos of receipts Proprietary
20
✔️ ✔️ ✔️
2012.09 Conference Paper Receipts2Go: The Big World of Small Documents
Bill Janssen, Eric Saund, Eric A. Bier, Patricia Wall, Mary Ann Sprague
photos of receipts N/A ✔️ ✔️ ✔️

Citations

Citations in Bibtex format are available here: references.bib.

To read

High priority
  • TBA
Low priority