Skip to content

A scalable document processing system designed to classify customer correspondence into "Complaint", "Appeal", or "Manual Review" using OCR, keyword detection, and text summarization via transformer models.

Notifications You must be signed in to change notification settings

Swaroop-Acharya/DocuSort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– DocuSort – Smart Document Classifier & Summarizer

DocuSort is an intelligent automation system that classifies and summarizes scanned documents such as customer correspondence in PDF or TIF formats. Leveraging OCR, NLP, and summarization pipelines, it streamlines customer service workflows by identifying complaints, appeals, or ambiguous content requiring manual review.


🧠 Features

  • 🧾 OCR Support: Extracts text from both text-based and scanned PDFs/TIFs using Tesseract and PyMuPDF.
  • 🏷️ Auto Classification: Categorizes documents into Complaint, Appeal, or Manual based on customizable keyword sets.
  • 🧠 AI Summarization: Generates concise summaries of long documents using Hugging Face’s bart-large-cnn model.
  • πŸ“Š Excel Report Generation: Compiles all results in a styled, timestamped Excel sheet.
  • πŸ” Multithreading Support: Processes multiple files concurrently for maximum performance.
  • πŸ“‚ File Archival: Automatically organizes processed files into categorized archive folders.

πŸ› οΈ Tech Stack

  • Python 3.9+
  • OCR: PyMuPDF (fitz), Tesseract, PIL, pdf2image
  • NLP: HuggingFace Transformers (BART model)
  • Excel Styling: openpyxl
  • Concurrency: ThreadPoolExecutor + Locks
  • Data Handling: pandas

πŸš€ How to Use

  1. πŸ“₯ Place input files: Drop .pdf or .tif files into the /input folder.
  2. ▢️ Run the script: Execute main() from the script to start processing.
  3. πŸ“Š Get results:
    • Extracted text saved in /extracted_text/<Category>/
    • Original files archived in /archive/<Category>/
    • Classification report saved as an Excel file in /classification_reports/

πŸ“ Directory Structure

πŸ“‚ input/
πŸ“‚ archive/
    β”œβ”€β”€ Complaint/
    β”œβ”€β”€ Appeal/
    └── Manual/
πŸ“‚ extracted_text/
    β”œβ”€β”€ Complaint/
    β”œβ”€β”€ Appeal/
    └── Manual/
πŸ“‚ classification_reports/
πŸ“‚ keywords/
    β”œβ”€β”€ complaint_keywords.txt
    └── appeal_keywords.txt

πŸ§ͺ Sample Output

Each row in the Excel report includes:

  • Serial Number
  • File Name
  • Processed Timestamp
  • Classification Result
  • AI-Generated Summary

πŸ›‘οΈ Customization

  • Add new keywords to the /keywords/complaint_keywords.txt or /keywords/appeal_keywords.txt files to tailor classification.
  • Plug in other transformer models if domain-specific summarization is required.

🧬 Use Cases

  • Prioritize sensitive customer communication
  • Automate triaging of service requests
  • Digitize and analyze scanned records with high efficiency

Built with ❀️ to make document handling intelligent, fast, and hassle-free.

About

A scalable document processing system designed to classify customer correspondence into "Complaint", "Appeal", or "Manual Review" using OCR, keyword detection, and text summarization via transformer models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages