Extract Text from PDF using Python Module

Modules Required

PyPDF2 - It is used in Python for PDF related operations

AIM

To build a Python Script using the PyPDF2 Module which can extract text from a PDF file.

COMPILATION STEPS

Import PyPDF2 Module to:-
- Read the pdf into the program to further manipulate it
- Count the number of pages in the PDF
- Extract the text from a single PDF page
Initialize an empty string which will store the text being extracted from the PDF file
A for loop is made to parse through each page
- The extractText() function is used to extract text from the parsed PDF page
- The extracted text is added to the emptry string initialized using simple string concatenation
After parsing is done, the string in which the extracted text is stored is written in a new file named extracted_text.txt using basic File Handling in Python

PDF FILE WITH TEXT

OUTPUT OF EXTRACTED TEXT SHOWN IN NEW TEXT FILE extracted_text.txt

PyPDF2:

A Pure-Python library built as a PDF toolkit. To know more: PyPDF2 Docs

File handling

Python has some inbuilt methods to handles files and perform operations like reading and writing. read about them : File Handling Docs

Author

@Sakalya Mitra