-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I've just noticed that PR #1172 introduced a space regression issue for text extraction. A lot of spaces got removed. Those spaces should have stayed.
Code + PDF
Just standard text extraction:
from PyPDF2 import PdfReader
reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"PDFs:
- https://arxiv.org/pdf/2201.00029.pdf - here it's very obvious
- https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf (German doc) - here it happens mostly with mathematical formula missing space to the surrounding text. That is a pattern I've seen in many of the other documents as well.
Metadata
Metadata
Assignees
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
