A modular, Dockerized Telegram bot for OCR-based document text extraction and language identification.
Built with MMOCR (TextSnake), Tesseract, and FastText, this system leverages both computer vision and natural language processing.
✅ Extracts text from user-submitted images
✅ Performs language detection using FastText (176+ languages)
✅ Uses MMOCR for text detection and Tesseract for OCR
✅ Fully Dockerized with Miniconda
✅ Async Telegram bot built with Aiogram
✅ Easily extendable and open-source
ocr-documents-system/
├── bot/
│ ├── handler/
│ │ ├── commands.py # /start, /help, /list handlers
│ │ └── image.py # Image recognition handler
│ ├── config.py # Bot configuration (Telegram token)
│ ├── logger.py # Logging setup
│ └── main.py # Bot startup entry point
│
├── ocr_engine/
│ ├── classifier.py # FastText-based language classifier
│ ├── config.py # OCRConfig dataclass for paths
│ ├── engine.py # Main OCREngine pipeline (TextSnake + Tesseract)
│ ├── lang_map.py # Mapping language code → name (Russian)
│ └── utils.py # Helper functions & language list
│
├── data/
│ ├── models/ # FastText `.bin` model
│ └── tessdata/ # Tesseract language files
│
├── .env # Environment variables
├── Dockerfile # Conda-based Docker build
├── .dockerignore # Ignoring unnecessary local files
├── .gitignore # Standard exclusions + /data
└── README.md # You're reading it
Interact using:
/start
/help
/list
Create a .env
file in the root directory with the following values:
BOT_TOKEN=your_telegram_bot_token
FASTTEXT_PATH=/ocr-documents-system/data/models/lid.176.bin
TESSDATA_PATH=/ocr-documents-system/data/tessdata
docker build -t ocr-bot .
docker run --env-file .env ocr-bot
- MMOCR (TextSnake)
- Tesseract OCR (tesserocr wrapper)
- FastText for language prediction
- Aiogram Telegram bot framework
- Docker + Miniconda for reproducible environments
Input: Scanned image of a document
Output:
- Extracted text
- Predicted language
- Inline message in Telegram
This project is licensed under the MIT License.
🐙 GitHub: AlekseyScorpi
📬 For questions or collaborations — feel free to reach out via GitHub Issues or Pull Requests.