We developed an automated method to extract key features such as the wine name, bottle and case prices from Sherry Lehmann scanned wine catalogs from the 1930’s to 1980’s.
https://nirvolo.wixsite.com/wine-catalog/
Pytesseract v3.ipynb -- Final version of codes for our approach.
Accuracy Test.ipynb -- Code that gets error rate of an image into a dataframe.(One image per output)
df_ave_error_revised.csv -- Excel sheet of the average error rates for our test set.
Practice.ipynb -- PDF pen pro outputs.
Pytesseract.ipynb -- Codes using different approaches we thought about in the early stages of the project.
Pytesseract v2.ipynb -- Code to test one of the best approaches we had.
Wine Tesseract test .R -- Codes using Tesseract in R software.
adistance.R -- R coding using Levenshtein distance.
docextractor.R -- R coding to extract text from docx files.