Awesome receipt data extraction

This repository contains resources helpful if you are going to build a system for key information extraction from photos of receipts.

Disclaimer

Quotes and images of publications listed below, which are available in this GitHub repository are shared here for educational purpose only. I don't own any copyrights for these publications. If you want me to delete your publication from this list and repository - please open an issue in this repository.

List of publications

Year	Type of document	Title, authors	Works on	Dataset, quantity, country of origin	Receipt detection	Receipt localization	Receipt normalization	Text line segmentation	Optical character recognition	Semantic analysis
2019.12	Preprint	LayoutLM: Pre-training of Text and Layout for Document Image Understanding Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou	scanned documents images with text segments and their position from OCR	IIT-CDIP 6kk	❌	❌	❌	❌	❌	✔️
2019.09	Workshop Paper	Post-OCR parsing: building simple and robust parser via BIO tagging Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, Hwalsuk Lee	receipts' text segments with position from OCR	CORD 1000	❌	❌	❌	❌	❗	✔️
2019.09	Workshop Paper	Chargrid-OCR: End-to-end Trainable Optical Character Recognition for Printed Documents using Instance Segmentation Christian Reisswig, Anoop R Katti, Marco Spinaci, Johannes Höhne	printed documents	Proprietary unknown synth + 43k real with noisy labels	❌	❌	❌	❌	✔️	❌
2019.09	Conference Paper	EATEN: Entity-aware Attention for Single Shot Visual Text Extraction He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding	train ticket photos and synthetic images of train tickets, passports and business cards	EATEN 2000 real train ticket + synth: 300k train ticket + 100k passport + 200k business card	❌	❌	❌	❌	❌	✔️
2019.09	Conference Paper	End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net Tuan Anh Nguyen Dang, Dat Nguyen Thanh	scanned invoices' and receipts' text with char-level bounding boxes from OCR	Toyota invoices dataset 261 + Daiichi medical receipts dataset 200	❌	❌	❌	❌	❌	✔️
2019.09	Conference Paper (ICDAR)	Attend, Copy, Parse End-to-end Information Extraction from Documents Rasmus Berg Palm, Florian Laws, Ole Winther	scanned and digitalized invoices text with char-level bounding boxes from OCR	Proprietary 1.2kk	❌	❌	❌	❌	❌	✔️
2019.09	Bachelor's thesis	Separation and Extraction of Valuable Information From Digital Receipts Using Google Cloud Vision OCR Elias Johansson	photos of receipts	Proprietary 53	❌	❌	✔️	❌	❗	✔️
2019.08	Conference Paper	Towards Unconstrained End-to-End Text Spotting Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, Ying Xiao	photos of scenes with naturalistic text	Proprietary, SynthText, ICDAR15, COCO-Text, ICDAR-MLT and Total-Text 30k, 200k, 1k, 17k, 7k and 1255	❌	❌	❌	✔️	✔️	❌
2019.07	Conference Paper (CBMI)	Receipt automatic reader Olga Maslova, Louis Klein, Damien Dabernat, A Benoit, Patrick Lambert	photos of receipts	Proprietary 1200 (receipt detection and segmentation) + 15 (text recognition quality)	✔️	✔️	✔️	✔️	❗	❌
2019.06	Preprint	CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor Xiaohui Zhao, Endi Niu, Zhuo Wu, and Xiaoguang Wang	receipts' text from OCR	Proprietary 4484, Spain + SROIE 2019 1000	❌	❌	❌	❌	❗	✔️
2019.06	Conference Paper	A Multitask Network for Localization and Recognition of Text in Images Mohammad Reza Sarshogh, Keegan E. Hines	synthetically-generated documents	Proprietary 10000	❌	❌	❌	✔️	✔️	❌
2019.06	Journal Article	Visual-Linguistic Methods for Receipt Field Recognition Rinon Gal, Nimrod Morag, Roy Shilkrot	scanned invoices' and receipts' text with char-level bounding boxes from OCR	Proprietary 5094	❌	❌	❌	❌	❌	✔️
2019.05	Conference Paper	Deep Learning Approach for Receipt Recognition Le Duc, Anh & Pham, Dung & Nguyen, Tuan	scanned receipts	SROIE 2019 1000	❌	✔️	❌	✔️	✔️	❌
2019.04	Conference Paper (ESANN)	A document detection technique using convolutional neural networks for optical character recognition systems Lorand Dobai, Mihai Teletin	photos of receipts	Proprietary 6700	❌	✔️	✔️	❌	❌	❌
2019.03	Conference Paper	Graph Convolution for Multimodal Information Extraction from Visually Rich Documents Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao	receipts' text segments from OCR	Value-Added Tax Invoices (VATI) 3000 + International Pur- chase Receipts (IPR) 1500	❌	❌	❌	❌	❌	✔️
2018.11	Conference Paper (ICPR)	A Novel Integrated Framework for Learning both Text Detection and Recognition Wanchen Sui, Qing Zhang, Jun Yang, Wei Chu	business card photographs and scanned handwritten text	Chinese Business Card Database 20k + IAM Handwriting Database 747	❌	❌	❌	✔️	✔️	❌
2018.08	Conference Paper	Chargrid: Towards Understanding 2D Documents Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, Jean Baptiste Faddoul	scanned invoices' text with char-level bounding boxes from OCR	Proprietary 12000	❌	❌	❌	❌	❗	✔️
2018.03	Conference Paper	Optical Character Recognition Engine to extract Food-items and Prices from Grocery Receipt Images via Templating and Dictionary-Traversal Technique Ali Sohani, Rafi Ullah, Faraz Ali, Athaul Rai, Richard Messier	photos of receipts	N/A	❌	✔️	✔️	❌	❗	✔️
2018.02	Journal Article	OCR Engine to Extract Food-Items, Prices, Quantity, Units from Receipt Images, Heuristics Rules Based Approach Rafi Ullah, Ali Sohani, Athaul Rai, Faraz Ali, Richard Messier	photos of receipts	N/A	❌	✔️	✔️	❌	❗	✔️
2018	BSc thesis	Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms Joel Odd, Emil Theologou	receipts' text from OCR	Proprietary 556, Sweden	❌	❌	❌	❌	❗	✔️
2018	Journal Article	Preprocessing Photos of Receipts for Recognition Wojciech Korobacz, Marek Tabędzki	photos of receipts	Proprietary 240	❌	✔️	✔️	❌	❗	❌
2018	Preprint	Automated Receipt Image Identification, Cropping, and Parsing Alex Yue	photos of receipts	Proprietary 50	❌	✔️	✔️	❌	❗	✔️
2017.12	Conference Paper	OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach Rafi Ullah, Ali Sohani, Faraz Ali, Athaul Rai	photos of receipts	N/A	❌	✔️	✔️	❌	❗	✔️
2017.10	Conference Paper	Deep Learning for automatic sale receipt understanding Rizlene Raoui-Outach, Cecile Million-Rousseau , Alexandre Benoit and Patrick Lambert	photos of receipts	Proprietary 3000	✔️	✔️	✔️	✔️	❗	❗
2017.09	Conference Paper (ICPR)	Fused Text Segmentation Networks for Multi-oriented Scene Text Detection Yuchen Dai, Zheng Huang, Yuting Gao, Youxuan Xu, Kai Chen, Jie Guo, Weidong Qiu	photos of scenes with naturalistic text	SynthText 160k	❌	❌	❌	✔️	❌	❌
2016.07	Bachelor's thesis	Optical Character Recognition on supermarket receipts Marco Ziegaus	scanned receipts	Proprietary 39	❌	❌	✔️	✔️	✔️	✔️
2015.08	Journal Article	OCR accuracy improvement on document images through a novel pre-processing approach Abdeslam El Harraj, Naoufal Raissouni	scanned documents	MTDB 500	❌	❌	✔️	❌	❌	❌
2015	Preprint	Mobile Scanner and OCR (A first step towards receipt to spreadsheet) Clement Ntwari Nshuti	photos of documents	Proprietary 77	❌	✔️	✔️	❌	❗	❌
2014	Preprint	A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images Roland Szabo	photos of receipts	Proprietary 20	❌	✔️	❌	✔️	✔️	❌
2012.09	Conference Paper	Receipts2Go: The Big World of Small Documents Bill Janssen, Eric Saund, Eric A. Bier, Patricia Wall, Mary Ann Sprague	photos of receipts	N/A	❌	✔️	✔️	❌	❗	✔️

Citations

Citations in Bibtex format are available here: references.bib.

To read

High priority

TBA

Low priority

Expense Control: A Gamified, Semi-Automated, Crowd-Based Approach For Receipt Capturing
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
Segmentation, Labeling and Optical Character Recognition Applied on Receipt Images
[D] Long-term Text-Recognition?
Find receipts, warp perspective and OCR with Tesseract JS in browser
Survey Of Receipt Identification And Classification Using Machine Learning
TBA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome receipt data extraction

Disclaimer

List of publications

Citations

To read

High priority

Low priority

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome receipt data extraction

Disclaimer

List of publications

Citations

To read

High priority

Low priority