Multilingual Document OCR Application

A Streamlit application for extracting text from documents in multiple languages using advanced image preprocessing techniques and OCR (Optical Character Recognition).

Features

Support for multiple languages
Advanced image preprocessing
Adjustable DPI settings
Confidence score reporting
Downloadable text output
Real-time image processing preview
Comprehensive scanning tips

Installation

1. Clone the Repository

git clone <your-repository-url>
cd <repository-directory>

2. Create Virtual Environment (Recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# Linux/MacOS
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Install Tesseract OCR

Windows:

Download the Tesseract installer from UB-Mannheim
Run the installer
Make sure to check "Add to PATH" during installation
Select additional language packs during installation as needed

Linux:

sudo apt update
sudo apt install tesseract-ocr
# Install language packs (replace 'lang' with language code)
sudo apt install tesseract-ocr-lang

MacOS:

brew install tesseract
# Install language packs (replace 'lang' with language code)
brew install tesseract-lang

Running the Application

Local Development

streamlit run app.py

Deploying to Streamlit Cloud

Create a Streamlit account at share.streamlit.io
Connect your GitHub repository
Deploy your app through the Streamlit dashboard

Usage

Upload a document image (supported formats: JPG, JPEG, PNG, BMP, TIFF)
Select the language(s) present in your document
Adjust DPI settings if needed
Click "Extract Text" to process the document
View results and download extracted text

Supported Languages

The application supports various languages including (but not limited to):

English
French
German
Spanish
Italian
Portuguese
Russian
Chinese (Simplified and Traditional)
Japanese
Korean
Arabic
Hindi
Bengali
Thai
Vietnamese

Note: Language availability depends on installed Tesseract language packs.

Project Structure

multilingual-ocr/
│
├── app.py                  # Main application file
├── requirements.txt        # Python dependencies
├── README.md              # Project documentation
├── .gitignore             # Git ignore file
└── .streamlit/            # Streamlit configuration
    └── config.toml        # Streamlit config file

Configuration

Streamlit Config

Create a .streamlit/config.toml file:

[theme]
primaryColor = "#F63366"
backgroundColor = "#FFFFFF"
secondaryBackgroundColor = "#F0F2F6"
textColor = "#262730"
font = "sans serif"

[server]
maxUploadSize = 200

Troubleshooting

Tesseract Not Found Error
- Verify Tesseract is installed correctly
- Check if Tesseract is added to PATH
- Confirm installation path in the code matches your system
Language Pack Issues
- Verify language packs are installed
- Check language code usage
- Install additional language packs as needed
Image Processing Errors
- Ensure image is in supported format
- Check image resolution and size
- Verify image is not corrupted

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Tesseract OCR engine
OpenCV for image processing
Streamlit for the web interface
PIL for image handling

Support

For support, please open an issue in the repository or contact the maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.streamlit		.streamlit
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
ocr_app.py		ocr_app.py
requirements.txt		requirements.txt
tesseract-ocr-w64-setup-5.4.0.20240606.exe		tesseract-ocr-w64-setup-5.4.0.20240606.exe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Document OCR Application

Features

Installation

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Install Tesseract OCR

Windows:

Linux:

MacOS:

Running the Application

Local Development

Deploying to Streamlit Cloud

Usage

Supported Languages

Project Structure

Configuration

Streamlit Config

Troubleshooting

Contributing

License

Acknowledgments

Support

About

Releases

Packages

Languages

Mahfoozalam1516/Multilingual_Doc_OCR_App

Folders and files

Latest commit

History

Repository files navigation

Multilingual Document OCR Application

Features

Installation

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Install Tesseract OCR

Windows:

Linux:

MacOS:

Running the Application

Local Development

Deploying to Streamlit Cloud

Usage

Supported Languages

Project Structure

Configuration

Streamlit Config

Troubleshooting

Contributing

License

Acknowledgments

Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages