PDF text extraction fails for subsetted/custom fonts — shows GLYPH placeholders

When passing a PDF to `DocumentStream` using the `stream` parameter (raw bytes), some PDFs produce garbled text with `GLYPH<c=...>` placeholders or letters that appear shifted (e.g., `7KH` instead of `THE`). This occurs only for PDFs that use subsetted fonts or custom encodings without a proper ToUnicode map. PDFs created from Word, LaTeX, or InDesign usually work fine, but certain exported PDFs or scanned PDFs fail.

**Steps to reproduce:**

```python
from docling import DocumentStream

# Using raw PDF bytes
with open("problematic.pdf", "rb") as f:
    doc = f.read()

source = DocumentStream(name="problematic.pdf", stream=doc)
conv_result = self.doc_converter.convert(source)
```

**Observed behavior:**

* Output text contains `GLYPH<c=...>` sequences.
* Letters may appear shifted, as if Caesar-ciphered.

**Expected behavior:**

* Extracted text should reflect the real content of the PDF.

**Environment:**

* Docling version: ` 2.40.0`
* Python version: `3.12.3`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF text extraction fails for subsetted/custom fonts — shows GLYPH placeholders #2170

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDF text extraction fails for subsetted/custom fonts — shows GLYPH placeholders #2170

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions