Experiments on extracting numerical text as strings from low-resolution graphics.
PAGASA's Seasonal Rainfall Forecast graphic (on /data/regions.JPG
) is used as a test target.
- Windows 10
- Python v3.10.5
- OpenCV for Python
- version 4.6.0.66
- Installed from
requirements.txt
- Tesseract OCR (for Windows)
- Tesseract at UB Mannheim - Installer for Windows
- tesseract-ocr-w64-setup-v5.1.0.20220510.exe
- Clone this repository.
git clone https://github.com/ciatph/textract.git
- Install dependencies.
pip install -r requirements.txt
- Create a
.env
file from the.env.example
file.- Replace the
TESSERACT_EXECUTABLE_PATH
variable with Tesseract's installation path on your machine.
- Replace the
- Run any of the python scripts below on the command line.
- Press ENTER to clear the image windows.
- Edit and ajust the image processing settings on the
.py
files to get the desired results. - Compare the accuracy of resulting extracted text to the image files.
Extracts numerical text using more complete image operations. Shows the binarized
and grayscale
versions of the cropped image target and surrounds significant objects with bounding boxes.
Extracts numerical text from grayscale, binarized image files. Draws bounding boxes on signnificant objects.
@ciatph
20220708