Skip to content

pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator #3060

@blushingpenguin

Description

@blushingpenguin

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.8.0-51-generic-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.3'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
temp_file = "/home/mark/test.pdf"
reader = PdfReader(temp_file)
for page in reader.pages:
    print(page.extract_text(extraction_mode="layout") + "\n")

I'm sorry -- I can't share this pdf file as it's a) sensitive and b) customer data. I can provide debug traces if those help. Also if there's a good way of anonymizing the data in the pdf without rewriting it then please let me know and I'll attempt that.

The PDF appears to have been generated by this library: https://bfo.com/products/report/?version=work-20200610T1518-r36819M according to the metadata

My reading of this problem is that extract_text in layout mode ends up calling down into text_show_operations, then in this particular pdf the Tf operator is ignored -- it seems the code is expecting it to be between a q/Q or BT/ET pair, which in this particular pdf it isn't. (It may be that this is a naughty pdf, but Okular and the Mozilla built in pdf reader both handle it ok).

from logging I added:
text_show_operations b'K', [0, 0, 0, 1]
text_show_operations b'rg', [0.341176, 0.392157, 0.462745]
text_show_operations b'Tf', ['/R1', 22] <--- ignored
text_show_operations b'BT', [] <-- some text ops with no font = boom

adding a couple of lines to handle Tf to text_show_operations fixes this:

  •        elif op == b"Tf":
    
  •            state_mgr.set_font(fonts[operands[0]], operands[1])
           else:  # set Tc, Tw, Tz, TL, and Ts if required. ignores all other ops
               state_mgr.set_state_param(op, operands)
    

and the output text looks to be about right.

However I'm not entirely sure if this is an appropriate fix (if it seems right, I'm happy to open a PR for it if that helps).

Traceback

Traceback (most recent call last):
  File "/home/mark/dev/vendeq-pdf-api/test.py", line 5, in <module>
    print(page.extract_text(extraction_mode="layout") + "\n")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_page.py", line 2368, in extract_text
    return self._layout_mode_text(
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_page.py", line 2261, in _layout_mode_text
    bt_groups = _layout_mode.text_show_operations(
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 279, in text_show_operations
    bts, tjs = recurs_to_target_op(
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 171, in recurs_to_target_op
    _tj = text_state_mgr.text_state_params()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_text_state_manager.py", line 92, in text_state_params
    raise PdfReadError(
pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator?

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions