Skip to content

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

@renanbirck

Description

@renanbirck

I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.

See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):

image

If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.

Environment

I am using Python 3.12 in Fedora 39.

$ python -m platform
Linux-6.6.4-200.fc39.x86_64-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf')
text = reader.pages[0].extract_text()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestshelp wantedWe appreciate help everywhere - this one might be an easy start!is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions