pdf_summarizer.py is a Python application that extracts text from PDF files, summarizes the content using a pre-trained BART model, and provides a GUI for easy interaction. The GUI allows users to select a PDF file, view the extracted text, clear the text, and save the summary.
- Extracts text from PDFs, including images with OCR.
- Summarizes extracted text using a pre-trained BART model.
- GUI for selecting PDFs, viewing extracted text, and saving summaries.
- Utilizes GPU for faster processing if available.
- Python 3.6 or higher
- Tesseract OCR
- Download the Tesseract installer from the Tesseract at UB Mannheim GitHub page.
- Run the installer and follow the prompts. By default, Tesseract will be installed in
C:\Program Files\Tesseract-OCR. - Ensure that the Tesseract installation directory is added to your system PATH.
-
Clone the repository:
git clone https://github.com/yourusername/pdf_summarizer.git cd pdf_summarizer -
Create a virtual environment (optional but recommended):
python -m venv myenv myenv\Scripts\activate # On Windows
-
Install the required Python packages:
pip install torch transformers pytesseract PyMuPDF Pillow
- Create a
config.jsonfile in the project directory with the following content:Adjust the path to Tesseract OCR if necessary.{ "tesseract_cmd": "C:\\Users\\AnonZanon\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe" }
Add config.json to your .gitignore file to ensure it is not tracked by Git:
config.json-
Run the Python script:
python pdf_summarizer.py
-
The GUI will open with the title "LZAKE's TEXT SUMMARIZER".
-
Use the buttons in the GUI to:
- Select PDF: Open a file dialog to select a PDF file. The extracted text and summary will be displayed in the text widgets.
- CLEAR: Clear the text in the output and summary widgets.
- SAVE: Save the summary text to a file.
- Select PDF: Opens a file dialog to select a PDF file for processing.
- CLEAR: Clears the text in the output and summary widgets.
- SAVE: Saves the summary text to a file.
Below is an example of what the GUI looks like when running the script:
- Ensure that Tesseract OCR is installed and its path is correctly set in the
config.jsonfile. - Verify that your Python environment has all the required packages installed.
This project is licensed under the MIT License. See the LICENSE file for details.
