OCR Accuracy Evaluation for Image-Based Text Extraction

This project is a practical, beginner-friendly guide for users with datasets of scanned images, archival documents, or photos of text who want to extract accurate text using OCR — no deep technical setup required.

All workflows run in Google Colab, so you don’t need to install anything locally.

What You’ll Learn / Do

Test 4 OCR tools side by side: Tesseract, EasyOCR, PaddleOCR, and Gemini
Improve accuracy with image preprocessing (grayscale, thresholding, shadow removal)
See which tool performs best for your dataset
Evaluate OCR output using Word Error Rate (WER) and Character Error Rate (CER)

Try It Out (Colab Links)

Notebook	Description	Link
🧪 `OCR_packages_comparison`	Compare multiple OCR engines on one image	Open in Colab
📅 `Preprocessing_demo`	Show impact of preprocessing visually	Open in Colab
🤖 `Gemini_error_solution`	Fixing Gemini setup and API usage	Open in Colab
🔍 `LLM_OCR_comparison`	Compare LLMs vs. OCR tools	Open in Colab

Images should be placed in your Google Drive under /OCR evaluation/Data/

Sample OCR + LLM Accuracy Table

Engine	WER	CER	LLM
Gemini 2-0 Flash	0.04	0.02	Yes
Qwen3-235B-A22B	0.06	0.03	Yes
Deepseek-V3-R1	0.29	0.26	Yes
Chat GBT 4-o	0.58	0.45	Yes
Tesseract	0.69	0.43	No
PaddleOCR	0.79	0.76	No
EasyOCR	0.89	0.67	No

Who This Is For

This repo is ideal for:

Humanities & archive researchers with scanned documents
Social scientists or students digitising printed material
Anyone with a folder of photos who just wants to know:
"Which OCR tool gives me the best results?"

You don’t need to install Python, or understand OCR theory — everything runs in Google Colab.

📌 Notes

Google Drive is used to store image data and outputs
Preprocessing is optional, but highly recommended for historical or noisy images
WER/CER comparisons use jiwer for reproducibility

👌 Credits

Built with:

📬 Feedback or questions? Feel free to open an issue or fork the repo.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Data		Data
Images		Images
Notebooks		Notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Accuracy Evaluation for Image-Based Text Extraction

What You’ll Learn / Do

Try It Out (Colab Links)

Sample OCR + LLM Accuracy Table

Who This Is For

📌 Notes

👌 Credits

About

Uh oh!

Releases

Packages

Languages

UnbrokenCocoon/OCR-evaluation

Folders and files

Latest commit

History

Repository files navigation

OCR Accuracy Evaluation for Image-Based Text Extraction

What You’ll Learn / Do

Try It Out (Colab Links)

Sample OCR + LLM Accuracy Table

Who This Is For

📌 Notes

👌 Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages