Skip to content

PDF text extraction fails for subsetted/custom fonts — shows GLYPH placeholders #2170

@manikrishna-m

Description

@manikrishna-m

When passing a PDF to DocumentStream using the stream parameter (raw bytes), some PDFs produce garbled text with GLYPH<c=...> placeholders or letters that appear shifted (e.g., 7KH instead of THE). This occurs only for PDFs that use subsetted fonts or custom encodings without a proper ToUnicode map. PDFs created from Word, LaTeX, or InDesign usually work fine, but certain exported PDFs or scanned PDFs fail.

Steps to reproduce:

from docling import DocumentStream

# Using raw PDF bytes
with open("problematic.pdf", "rb") as f:
    doc = f.read()

source = DocumentStream(name="problematic.pdf", stream=doc)
conv_result = self.doc_converter.convert(source)

Observed behavior:

  • Output text contains GLYPH<c=...> sequences.
  • Letters may appear shifted, as if Caesar-ciphered.

Expected behavior:

  • Extracted text should reflect the real content of the PDF.

Environment:

  • Docling version: 2.40.0
  • Python version: 3.12.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions