pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator

## Environment

Which environment were you using when you encountered the problem?

```bash
$ python -m platform
Linux-6.8.0-51-generic-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.3'), PIL=none
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
from pypdf import PdfReader
temp_file = "/home/mark/test.pdf"
reader = PdfReader(temp_file)
for page in reader.pages:
    print(page.extract_text(extraction_mode="layout") + "\n")
```

I'm sorry -- I can't share this pdf file as it's a) sensitive and b) customer data. I can provide debug traces if those help. Also if there's a good way of anonymizing the data in the pdf without rewriting it then please let me know and I'll attempt that.

The PDF appears to have been generated by this library: https://bfo.com/products/report/?version=work-20200610T1518-r36819M according to the metadata

My reading of this problem is that extract_text in layout mode ends up calling down into text_show_operations, then in this particular pdf the Tf operator is ignored -- it seems the code is expecting it to be between a q/Q or BT/ET pair, which in this particular pdf it isn't. (It may be that this is a naughty pdf, but Okular and the Mozilla built in pdf reader both handle it ok).

```
from logging I added:
text_show_operations b'K', [0, 0, 0, 1]
text_show_operations b'rg', [0.341176, 0.392157, 0.462745]
text_show_operations b'Tf', ['/R1', 22] <--- ignored
text_show_operations b'BT', [] <-- some text ops with no font = boom
```

adding a couple of lines to handle Tf to text_show_operations fixes this:
+            elif op == b"Tf":
+                state_mgr.set_font(fonts[operands[0]], operands[1])
             else:  # set Tc, Tw, Tz, TL, and Ts if required. ignores all other ops
                 state_mgr.set_state_param(op, operands)

and the output text looks to be about right.

However I'm not entirely sure if this is an appropriate fix (if it seems right, I'm happy to open a PR for it if that helps).

## Traceback
```
Traceback (most recent call last):
  File "/home/mark/dev/vendeq-pdf-api/test.py", line 5, in <module>
    print(page.extract_text(extraction_mode="layout") + "\n")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_page.py", line 2368, in extract_text
    return self._layout_mode_text(
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_page.py", line 2261, in _layout_mode_text
    bt_groups = _layout_mode.text_show_operations(
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 279, in text_show_operations
    bts, tjs = recurs_to_target_op(
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_fixed_width_page.py", line 171, in recurs_to_target_op
    _tj = text_state_mgr.text_state_params()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mark/dev/vendeq-pdf-api/.venv/lib/pypy3.10/site-packages/pypdf/_text_extraction/_layout_mode/_text_state_manager.py", line 92, in text_state_params
    raise PdfReadError(
pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator?
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator #3060

Environment

Code + PDF

Traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pypdf.errors.PdfReadError: font not set: is PDF missing a Tf operator #3060

Description

Environment

Code + PDF

Traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions