Skip to content

Space regression by PR 1172 #1362

@MartinThoma

Description

@MartinThoma

I've just noticed that PR #1172 introduced a space regression issue for text extraction. A lot of spaces got removed. Those spaces should have stayed.

Code + PDF

Just standard text extraction:

from PyPDF2 import PdfReader

reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

PDFs:

See https://arxiv.org/pdf/2201.00029.pdf :

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions