PDF Summarizer

README

PDF Summarizer

pdf_summarizer.py is a Python application that extracts text from PDF files, summarizes the content using a pre-trained BART model, and provides a GUI for easy interaction. The GUI allows users to select a PDF file, view the extracted text, clear the text, and save the summary.

Features

Extracts text from PDFs, including images with OCR.
Summarizes extracted text using a pre-trained BART model.
GUI for selecting PDFs, viewing extracted text, and saving summaries.
Utilizes GPU for faster processing if available.

Installation

Prerequisites

Python 3.6 or higher
Tesseract OCR

Step 1: Install Tesseract OCR

Download the Tesseract installer from the Tesseract at UB Mannheim GitHub page.
Run the installer and follow the prompts. By default, Tesseract will be installed in C:\Program Files\Tesseract-OCR.
Ensure that the Tesseract installation directory is added to your system PATH.

Step 2: Clone the Repository and Install Dependencies

Clone the repository:

git clone https://github.com/yourusername/pdf_summarizer.git
cd pdf_summarizer

Create a virtual environment (optional but recommended):

python -m venv myenv
myenv\Scripts\activate  # On Windows

Install the required Python packages:

pip install torch transformers pytesseract PyMuPDF Pillow

Step 3: Create Configuration File

Create a config.json file in the project directory with the following content:
```
{
    "tesseract_cmd": "C:\\Users\\AnonZanon\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe"
}
```
Adjust the path to Tesseract OCR if necessary.

Step 4: Add Configuration File to `.gitignore`

Add config.json to your .gitignore file to ensure it is not tracked by Git:

config.json

Usage

Run the Python script:
```
python pdf_summarizer.py
```
The GUI will open with the title "LZAKE's TEXT SUMMARIZER".
Use the buttons in the GUI to:
- Select PDF: Open a file dialog to select a PDF file. The extracted text and summary will be displayed in the text widgets.
- CLEAR: Clear the text in the output and summary widgets.
- SAVE: Save the summary text to a file.

GUI Overview

Select PDF: Opens a file dialog to select a PDF file for processing.
CLEAR: Clears the text in the output and summary widgets.
SAVE: Saves the summary text to a file.

Example

Below is an example of what the GUI looks like when running the script:

Troubleshooting

Ensure that Tesseract OCR is installed and its path is correctly set in the config.json file.
Verify that your Python environment has all the required packages installed.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
example.png		example.png
pdf_summarizer.py		pdf_summarizer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

PDF Summarizer

Features

Installation

Prerequisites

Step 1: Install Tesseract OCR

Step 2: Clone the Repository and Install Dependencies

Step 3: Create Configuration File

Step 4: Add Configuration File to `.gitignore`

Usage

GUI Overview

Example

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

lzake/PyPDF

Folders and files

Latest commit

History

Repository files navigation

README

PDF Summarizer

Features

Installation

Prerequisites

Step 1: Install Tesseract OCR

Step 2: Clone the Repository and Install Dependencies

Step 3: Create Configuration File

Step 4: Add Configuration File to .gitignore

Usage

GUI Overview

Example

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Step 4: Add Configuration File to `.gitignore`

Packages