-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
whitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.
Environment
Using VS code and running via command prompt.
$ python -m platform
Windows-10-10.0.22621-SP0
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.12.1
Code + PDF
This is a minimal, complete example that shows the issue:
test_doc.pdf
(PDF was generated using default settings in Microsoft word). It looks like this:
The code is:
import os
from PyPDF2 import PdfReader, __version__
pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))
print(f"PyPDF2=={__version__}")
text = ""
for page in pdf.pages:
page_content = page.extract_text()
text = text + page_content
print(text)
Output
PyPDF2==2.12.1
This is a test document by Ethan Nelson.
Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for
testing purposes : 341 Maple st Paytonville Maine 45681.
Anyway, there are random whitespaces here .
KimBenjaminTang and dethosMartinThoma and dethos
Metadata
Metadata
Assignees
Labels
whitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow