-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestshelp wantedWe appreciate help everywhere - this one might be an easy start!We appreciate help everywhere - this one might be an easy start!is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.
See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):
If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.
Environment
I am using Python 3.12 in Fedora 39.
$ python -m platform
Linux-6.6.4-200.fc39.x86_64-x86_64-with-glibc2.38
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.1.0Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader('Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf')
text = reader.pages[0].extract_text()Metadata
Metadata
Assignees
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestshelp wantedWe appreciate help everywhere - this one might be an easy start!We appreciate help everywhere - this one might be an easy start!is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
