Skip to content

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

@supertassu

Description

@supertassu

Hi! The ATP rankings are published as a PDF that I'm trying to parse, but since pypdf 4.3 calling extract_text() no longer includes new line characters.

This worked fine on pypdf 4.2, so I did a git bisect. That suggests that this issue was introduced in commit 23a81ba.

Environment

This is with Python 3.12.4 in a venv on Debian testing.

$ venv/bin/python3 -m platform
Linux-6.9.9-amd64-x86_64-with-glibc2.39

$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0

Code + PDF

The following PDF is the first page of the published results for Jul 22, 2024:
singles_entry_numerical_2024_07_22_firstpage.pdf

import pypdf

parser = pypdf.PdfReader("singles_entry_numerical_2024_07_22_firstpage.pdf")
page = parser.pages[0]
text = page.extract_text()

print(text)

When running this with pypdf 4.2, the extracted text contains new line characters just fine:

$ venv/bin/pip3 install pypdf==4.2.0
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
Rankings Date: 
Rank # Player Jul 22, 2024 Grand Slam 
Natl. Points 
Dropping Next 
Best Tourns. 
Played Points Masters 
1000 Points Other 
Points Points Total 
1 Sinner, Jannik (ITA) 9570 3380 0 0 18 3190 3000 
[...]
Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024 

But on 4.3, new lines are no longer included:

$ venv/bin/pip3 install pypdf==4.3.1
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
$ venv/bin/python3 test.py 
Rankings Date: Rank # Player Jul 22, 2024 Grand Slam Natl. Points Dropping Next Best Tourns. Played Points Masters 1000 Points Other Points Points Total 1 Sinner, Jannik (ITA) 9570 3380 0 018 3190 3000 [...] Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024 

Traceback

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-regressionRegression introduced as a side-effect of another changeworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions