-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
When passing a PDF to DocumentStream using the stream parameter (raw bytes), some PDFs produce garbled text with GLYPH<c=...> placeholders or letters that appear shifted (e.g., 7KH instead of THE). This occurs only for PDFs that use subsetted fonts or custom encodings without a proper ToUnicode map. PDFs created from Word, LaTeX, or InDesign usually work fine, but certain exported PDFs or scanned PDFs fail.
Steps to reproduce:
from docling import DocumentStream
# Using raw PDF bytes
with open("problematic.pdf", "rb") as f:
doc = f.read()
source = DocumentStream(name="problematic.pdf", stream=doc)
conv_result = self.doc_converter.convert(source)Observed behavior:
- Output text contains
GLYPH<c=...>sequences. - Letters may appear shifted, as if Caesar-ciphered.
Expected behavior:
- Extracted text should reflect the real content of the PDF.
Environment:
- Docling version:
2.40.0 - Python version:
3.12.3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working