DocuSort is an intelligent automation system that classifies and summarizes scanned documents such as customer correspondence in PDF or TIF formats. Leveraging OCR, NLP, and summarization pipelines, it streamlines customer service workflows by identifying complaints, appeals, or ambiguous content requiring manual review.
- π§Ύ OCR Support: Extracts text from both text-based and scanned PDFs/TIFs using Tesseract and PyMuPDF.
- π·οΈ Auto Classification: Categorizes documents into
Complaint
,Appeal
, orManual
based on customizable keyword sets. - π§ AI Summarization: Generates concise summaries of long documents using Hugging Faceβs
bart-large-cnn
model. - π Excel Report Generation: Compiles all results in a styled, timestamped Excel sheet.
- π Multithreading Support: Processes multiple files concurrently for maximum performance.
- π File Archival: Automatically organizes processed files into categorized archive folders.
- Python 3.9+
- OCR: PyMuPDF (
fitz
), Tesseract, PIL,pdf2image
- NLP: HuggingFace Transformers (BART model)
- Excel Styling: openpyxl
- Concurrency: ThreadPoolExecutor + Locks
- Data Handling: pandas
- π₯ Place input files: Drop
.pdf
or.tif
files into the/input
folder. βΆοΈ Run the script: Executemain()
from the script to start processing.- π Get results:
- Extracted text saved in
/extracted_text/<Category>/
- Original files archived in
/archive/<Category>/
- Classification report saved as an Excel file in
/classification_reports/
- Extracted text saved in
π input/
π archive/
βββ Complaint/
βββ Appeal/
βββ Manual/
π extracted_text/
βββ Complaint/
βββ Appeal/
βββ Manual/
π classification_reports/
π keywords/
βββ complaint_keywords.txt
βββ appeal_keywords.txt
Each row in the Excel report includes:
- Serial Number
- File Name
- Processed Timestamp
- Classification Result
- AI-Generated Summary
- Add new keywords to the
/keywords/complaint_keywords.txt
or/keywords/appeal_keywords.txt
files to tailor classification. - Plug in other transformer models if domain-specific summarization is required.
- Prioritize sensitive customer communication
- Automate triaging of service requests
- Digitize and analyze scanned records with high efficiency
Built with β€οΈ to make document handling intelligent, fast, and hassle-free.