This project implements an OCR (Optical Character Recognition) system specifically designed for historical text recognition, focusing on early modern printed sources. The system uses a combination of deep learning architectures to accurately transcribe historical documents while handling various challenges like layout variations and text embellishments.
.
├── data/ # Dataset storage
│ ├── raw/ # Original PDF scans
│ ├── processed/ # Processed images and annotations
│ └── transcriptions/# Reference transcriptions
├── notebooks/ # Jupyter notebooks for analysis
├── src/ # Source code
│ ├── data/ # Data processing utilities
│ ├── models/ # Model architectures
│ ├── training/ # Training scripts
│ └── utils/ # Utility functions
├── outputs/ # Model outputs and results
└── requirements.txt # Project dependencies
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtThe OCR system uses a hybrid architecture combining:
- A CNN backbone for feature extraction
- A Transformer encoder for sequence modeling
- A CTC decoder for text recognition
Key features:
- Layout-aware text detection
- Robust handling of historical fonts and styles
- Support for early modern English text variations
The model is evaluated using:
- Character Error Rate (CER)
- Word Error Rate (WER)
- Layout detection accuracy
- Processing speed
- Data Preparation:
python src/data/prepare_data.py- Training:
python src/training/train.py- Inference:
python src/inference.pyModel performance metrics and visualizations are stored in the outputs/ directory.
This project is licensed under the MIT License - see the LICENSE file for details.