A Streamlit application for extracting text from documents in multiple languages using advanced image preprocessing techniques and OCR (Optical Character Recognition).
- Support for multiple languages
- Advanced image preprocessing
- Adjustable DPI settings
- Confidence score reporting
- Downloadable text output
- Real-time image processing preview
- Comprehensive scanning tips
git clone <your-repository-url>
cd <repository-directory>
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/MacOS
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Download the Tesseract installer from UB-Mannheim
- Run the installer
- Make sure to check "Add to PATH" during installation
- Select additional language packs during installation as needed
sudo apt update
sudo apt install tesseract-ocr
# Install language packs (replace 'lang' with language code)
sudo apt install tesseract-ocr-lang
brew install tesseract
# Install language packs (replace 'lang' with language code)
brew install tesseract-lang
streamlit run app.py
- Create a Streamlit account at share.streamlit.io
- Connect your GitHub repository
- Deploy your app through the Streamlit dashboard
- Upload a document image (supported formats: JPG, JPEG, PNG, BMP, TIFF)
- Select the language(s) present in your document
- Adjust DPI settings if needed
- Click "Extract Text" to process the document
- View results and download extracted text
The application supports various languages including (but not limited to):
- English
- French
- German
- Spanish
- Italian
- Portuguese
- Russian
- Chinese (Simplified and Traditional)
- Japanese
- Korean
- Arabic
- Hindi
- Bengali
- Thai
- Vietnamese
Note: Language availability depends on installed Tesseract language packs.
multilingual-ocr/
│
├── app.py # Main application file
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── .gitignore # Git ignore file
└── .streamlit/ # Streamlit configuration
└── config.toml # Streamlit config file
Create a .streamlit/config.toml
file:
[theme]
primaryColor = "#F63366"
backgroundColor = "#FFFFFF"
secondaryBackgroundColor = "#F0F2F6"
textColor = "#262730"
font = "sans serif"
[server]
maxUploadSize = 200
-
Tesseract Not Found Error
- Verify Tesseract is installed correctly
- Check if Tesseract is added to PATH
- Confirm installation path in the code matches your system
-
Language Pack Issues
- Verify language packs are installed
- Check language code usage
- Install additional language packs as needed
-
Image Processing Errors
- Ensure image is in supported format
- Check image resolution and size
- Verify image is not corrupted
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Tesseract OCR engine
- OpenCV for image processing
- Streamlit for the web interface
- PIL for image handling
For support, please open an issue in the repository or contact the maintainers.