-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.
Environment
$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0Code + PDF
PDF: 0004.pdf
from pypdf import PdfReader, __version__
print(f"pypdf=={__version__}")
reader = PdfReader("0004.pdf")
page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)gives:
Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.expected (copy-pasted with Google chrome):
Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.
Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx
p,s, Thank you for the great package!
philsc and tpcgoldpubpub-zz and chessgeckopubpub-zz and chessgecko
Metadata
Metadata
Assignees
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow