A Flask-based web application that extracts text from uploaded images using Optical Character Recognition (OCR), supporting both Hindi and English languages.
- Bilingual OCR: Extracts text from images containing both Hindi and English text
- Image Preprocessing: Applies denoising and thresholding to improve OCR accuracy
- Web Interface: User-friendly web interface for uploading images and viewing results
- Secure File Handling: Uses secure filename handling to prevent directory traversal attacks
Before running this application, ensure you have the following installed:
- Python 3.7+
- Tesseract OCR engine
- Hindi language data for Tesseract
- Download Tesseract installer from UB-Mannheim/tesseract
- Run the installer
- Add Tesseract to your system PATH
- Download Hindi language data (hin.traineddata) and place it in the Tesseract
tessdatadirectory
brew install tesseract
brew install tesseract-langsudo apt install tesseract-ocr
sudo apt install tesseract-ocr-hingit clone <your-repo-url>
cd <repository-directory>python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpython
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Update this pathStart the Flask application:
python app.pyOpen your web browser and navigate to http://localhost:5000
Upload an image containing text (Hindi, English, or both) View the extracted text on the results page
├── app.py # Main Flask application
├── templates/
│ ├── index.html # Home page with upload form
│ └── result.html # Results display page
├── static/ # Directory for uploaded images
├── requirements.txt # Python dependencies
└── README.md # This file