Skip to content

Missing spaces in extract_text() method #1328

@Sunguru

Description

@Sunguru

Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions