-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.8.0-51-generic-x86_64-with-glibc2.39
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.3'), PIL=none
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
temp_file = "/home/mark/test.pdf"
reader = PdfReader(temp_file)
for page in reader.pages:
print(page.extract_text(extraction_mode="layout") + "\n")
I'm sorry -- I can't share this pdf file as it's a) sensitive and b) customer data. I can provide debug traces if those help. Also if there's a good way of anonymizing the data in the pdf without rewriting it then please let me know and I'll attempt that.
The PDF appears to have been generated by this library: https://bfo.com/products/report/?version=work-20200610T1518-r36819M according to the metadata
My reading of this problem is that extract_text in layout mode ends up calling down into text_show_operations, then in this particular pdf the Tf operator is ignored -- it seems the code is expecting it to be between a q/Q or BT/ET pair, which in this particular pdf it isn't. (It may be that this is a naughty pdf, but Okular and the Mozilla built in pdf reader both handle it ok).
from logging I added:
text_show_operations b'K', [0, 0, 0, 1]
text_show_operations b'rg', [0.341176, 0.392157, 0.462745]
text_show_operations b'Tf', ['/R1', 22] <--- ignored
text_show_operations b'BT', [] <-- some text ops with no font = boom
adding a couple of lines to handle Tf to text_show_operations fixes this:
-
elif op == b"Tf":
-
state_mgr.set_font(fonts[operands[0]], operands[1]) else: # set Tc, Tw, Tz, TL, and Ts if required. ignores all other ops state_mgr.set_state_param(op, operands)
and the output text looks to be about right.
However I'm not entirely sure if this is an appropriate fix (if it seems right, I'm happy to open a PR for it if that helps).
Traceback
Traceback (most recent call last):
File "/home/mark/dev/vendeq-pdf-api/test.py", line 5, in <module>
print(page.extract_text(extraction_mode="layout") + "\n")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_page.py", line 2368, in extract_text
return self._layout_mode_text(
File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_page.py", line 2261, in _layout_mode_text
bt_groups = _layout_mode.text_show_operations(
File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 279, in text_show_operations
bts, tjs = recurs_to_target_op(
File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 171, in recurs_to_target_op
_tj = text_state_mgr.text_state_params()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_text_state_manager.py", line 92, in text_state_params
raise PdfReadError(
pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator?