A powerful FastAPI-based service for extracting text from various document formats including PDFs, images, and Microsoft Office documents. The service uses OCR technology and native parsers to handle multiple file types efficiently.
- PDF Text Extraction: Extract text from PDF files using native parsing and OCR fallback for scanned PDFs
- Image OCR: Extract text from images using Tesseract OCR engine
- Office Document Support: Extract text from Microsoft Office documents (Word, Excel, PowerPoint)
- Multiple Image Formats: Support for JPEG, PNG, GIF, WebP image formats
- Web UI: User-friendly drag-and-drop interface for easy document uploads
- File Size Limits: Configurable file size validation (default 50MB)
- Docker Ready: Containerized application with Docker and Docker Compose support
- Health Check: Built-in health monitoring endpoint
- REST API: Clean RESTful API with automatic documentation
- CORS Enabled: Cross-origin resource sharing for frontend integration
| Category | Formats | Method |
|---|---|---|
.pdf |
pdfplumber + OCR fallback | |
| Images | .jpg, .jpeg, .png, .gif, .webp |
Tesseract OCR |
| Office | .docx, .xlsx, .pptx |
Native parsers |
- Clone the repository:
git clone <repository-url>
cd document-text-extractor- Start the service:
docker-compose up -d- Access the application:
- Web UI:
http://localhost:8000- User-friendly interface for document uploads - API Documentation:
http://localhost:8000/docs- Interactive API documentation - Health Check:
http://localhost:8000/health- Service status
- Web UI:
- Install system dependencies:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr libtesseract-dev poppler-utils
# macOS
brew install tesseract poppler
# Windows
# Download and install Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki- Install Python dependencies:
pip install -r requirements.txt- Run the application:
uvicorn app.main:app --host 0.0.0.0 --port 8000Setup Run Configuration:
- Open Run β Edit Configurations...
- Click + β Python
- Configure as follows:
- Name:
FastAPI - Document Extractor - Module name: (select radio button) β
uvicorn - Parameters:
app.main:app --reload --host 0.0.0.0 --port 8000 - Working directory:
/path/to/document-text-extractor - Environment variables:
MAX_FILE_SIZE_MB=50;TESSERACT_CMD=tesseract
- Name:
- Click OK and run with the green play button
Alternative - Using Script:
Create run.py in the project root:
import uvicorn
if __name__ == "__main__":
uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)Then set Script path to run.py in the run configuration.
Once the service is running, visit:
- Interactive API Documentation:
http://localhost:8000/docs - Alternative Documentation:
http://localhost:8000/redoc
The application includes a user-friendly web interface for easy document text extraction:
Open your browser and navigate to:
http://localhost:8000
- Drag-and-Drop Upload: Simply drag your file onto the upload zone or click to browse
- Supported Formats Display: Clear indication of all supported file formats
- Client-Side Validation: Instant feedback on file size and format before upload
- Progress Indicator: Visual feedback during text extraction
- Result Display: View extracted text in a textarea with character count
- Copy to Clipboard: One-click copy of extracted text
- Download as TXT: Save extracted text as a text file
- Specific Error Messages: Clear error feedback for troubleshooting
The Web UI supports the following file types:
- PDF:
.pdf - Office Documents:
.docx,.xlsx,.pptx - Images:
.jpg,.jpeg,.png,.gif,.webp
- Default Maximum Size: 50 MB
- Configurable: Can be adjusted via environment variable (see Configuration section)
GET /healthReturns the service status.
POST /extract/pdf
Content-Type: multipart/form-data
file: <PDF_FILE>POST /extract/image
Content-Type: multipart/form-data
file: <IMAGE_FILE>POST /extract/office
Content-Type: multipart/form-data
file: <OFFICE_FILE>Extract text from PDF:
curl -X POST "http://localhost:8000/extract/pdf" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.pdf"Extract text from image:
curl -X POST "http://localhost:8000/extract/image" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@image.png"Extract text from Office document:
curl -X POST "http://localhost:8000/extract/office" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.docx"import requests
# Extract text from PDF
with open('document.pdf', 'rb') as f:
response = requests.post(
'http://localhost:8000/extract/pdf',
files={'file': f}
)
text = response.json()['text']
print(text)document-text-extractor/
βββ app/
β βββ main.py # FastAPI application and endpoints
βββ static/
β βββ index.html # Web UI interface
β βββ app.js # Client-side JavaScript logic
βββ docker-compose.yml # Docker Compose configuration
βββ Dockerfile # Docker image definition
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
- FastAPI: Modern web framework for building APIs
- uvicorn: ASGI server for running FastAPI
- pytesseract: Python wrapper for Tesseract OCR
- Pillow: Image processing library
- pdfplumber: PDF text extraction
- python-multipart: File upload support
- python-docx: Word document processing
- openpyxl: Excel file processing
- python-pptx: PowerPoint presentation processing
TESSERACT_CMD: Path to Tesseract executable (default: "tesseract")MAX_FILE_SIZE_MB: Maximum allowed file size in megabytes (default: "50")
To change the maximum file size limit, modify the docker-compose.yml file:
environment:
TESSERACT_CMD: "/usr/bin/tesseract"
MAX_FILE_SIZE_MB: "100" # Change to desired size in MBOr set it when running manually:
export MAX_FILE_SIZE_MB=100
uvicorn app.main:app --host 0.0.0.0 --port 8000The application is configured to run on port 8000. You can modify the port mapping in docker-compose.yml if needed.
Volume Mounts:
./app:/app/app- Application code./static:/app/static- Web UI files
The API provides comprehensive error handling:
- 400 Bad Request: Invalid file format or corrupted files
- 413 Payload Too Large: File size exceeds the configured maximum limit
- 500 Internal Server Error: Processing errors with detailed error messages
The Web UI provides user-friendly error messages for all error scenarios.
- PDF Processing: Large PDFs with many images may take longer due to OCR processing
- Image Quality: Higher resolution images provide better OCR accuracy
- Memory Usage: Processing large files may require sufficient memory allocation
Terminal:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000PyCharm:
- Use the run configuration setup described in the "Running from PyCharm" section
- Ensure your Python interpreter has all dependencies from
requirements.txtinstalled - The
--reloadflag enables auto-reload on code changes
To add support for new file formats:
- Install required libraries in
requirements.txt - Add extraction logic in
main.py - Update the API endpoint to handle the new format
- Update this README with the new supported format
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is part of the Berijalan Bootcamp Techno 14 program.
Tesseract not found:
- Ensure Tesseract is installed and accessible in your system PATH
- Set the
TESSERACT_CMDenvironment variable to the correct path
Memory errors with large files:
- Increase Docker memory limits if using containers
- Consider implementing file size limits for your use case
Poor OCR accuracy:
- Ensure images have sufficient resolution (300+ DPI recommended)
- Preprocess images to improve contrast and clarity