-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
is-regressionRegression introduced as a side-effect of another changeRegression introduced as a side-effect of another changeworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Hi! The ATP rankings are published as a PDF that I'm trying to parse, but since pypdf 4.3 calling extract_text() no longer includes new line characters.
This worked fine on pypdf 4.2, so I did a git bisect. That suggests that this issue was introduced in commit 23a81ba.
Environment
This is with Python 3.12.4 in a venv on Debian testing.
$ venv/bin/python3 -m platform
Linux-6.9.9-amd64-x86_64-with-glibc2.39
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0Code + PDF
The following PDF is the first page of the published results for Jul 22, 2024:
singles_entry_numerical_2024_07_22_firstpage.pdf
import pypdf
parser = pypdf.PdfReader("singles_entry_numerical_2024_07_22_firstpage.pdf")
page = parser.pages[0]
text = page.extract_text()
print(text)When running this with pypdf 4.2, the extracted text contains new line characters just fine:
$ venv/bin/pip3 install pypdf==4.2.0
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
Rankings Date:
Rank # Player Jul 22, 2024 Grand Slam
Natl. Points
Dropping Next
Best Tourns.
Played Points Masters
1000 Points Other
Points Points Total
1 Sinner, Jannik (ITA) 9570 3380 0 0 18 3190 3000
[...]
Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024
But on 4.3, new lines are no longer included:
$ venv/bin/pip3 install pypdf==4.3.1
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
$ venv/bin/python3 test.py
Rankings Date: Rank # Player Jul 22, 2024 Grand Slam Natl. Points Dropping Next Best Tourns. Played Points Masters 1000 Points Other Points Points Total 1 Sinner, Jannik (ITA) 9570 3380 0 018 3190 3000 [...] Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024
Traceback
N/A
Metadata
Metadata
Assignees
Labels
is-regressionRegression introduced as a side-effect of another changeRegression introduced as a side-effect of another changeworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow