Skip to content

pushkar2510/Doc_Sum

 
 

Repository files navigation

PDF and Image Text Analyzer

This application allows users to upload PDF files or images, extract text, generate summaries, break text into paragraphs, and create questions/answers for selected paragraphs.

Features

  • PDF and image text extraction
  • Text summarization using T5 model
  • Paragraph segmentation
  • Question and answer generation

Requirements

  • Python 3.7+
  • Tesseract OCR must be installed for image text extraction

Installation

  1. Clone this repository or download the files
  2. Install the required dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR:

Usage

  1. Run the Streamlit app:
streamlit run app.py
  1. Upload a PDF or image file
  2. View the extracted text and summary
  3. Explore the paragraphs
  4. Select a paragraph number and click "Generate Q&A" to create questions and answers based on that paragraph

Important Notes

  • For large files, processing may take some time
  • The quality of text extraction from images depends on the clarity of the image
  • For optimal performance, ensure your PDF contains selectable text (not scanned images)

About

Document Summarizer

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 42.4%
  • HTML 33.8%
  • JavaScript 21.4%
  • CSS 2.4%