fix: preserve OCR text from picture clusters during page assembly by sriharan0804 · Pull Request #3686 · docling-project/docling

sriharan0804 · 2026-06-23T17:02:06Z

Summary

While investigating #3569, I found that OCR text extracted from scanned PDFs can be lost during page assembly when the layout model classifies the page as a picture cluster.

In these cases:

OCR extraction succeeds and produces valid TextCell objects.
OCR cells are correctly stored in parsed_page.textline_cells.
The layout model classifies the region as a picture.
PageAssembleModel creates a FigureElement/picture element and the OCR text is not included in the exported document.
Markdown and text exports therefore contain only  even though OCR text is available.

Changes

This PR adds a fallback in PageAssembleModel:

When a picture cluster contains OCR-generated text cells, a TextElement is created from those OCR cells.
The extracted text is preserved in document assembly and becomes available in Markdown/Text export.
Existing figure handling remains unchanged for picture clusters that do not contain OCR text.

Reproduction

Before this change:

OCR extraction -> succeeds
OCR cells -> present
Export -> <!-- image -->
Text output -> empty

After this change:

OCR extraction -> succeeds
OCR cells -> present
Export -> OCR text preserved
Text output -> contains extracted text

Notes

This PR addresses one root cause discovered during the investigation of #3569. Additional OCR-quality or language-specific issues may still require separate investigation.

github-actions · 2026-06-23T17:02:24Z

✅ DCO Check Passed

Thanks @sriharan0804, all your commits are properly signed off. 🎉

mergify · 2026-06-23T17:02:43Z

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>

…oreply.github.com> I, sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>, hereby add my Signed-off-by to this commit: da9510a Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>

sriharan0804 · 2026-06-23T20:19:41Z

@Yuanyi1104

While investigating this, I found one related root cause: OCR text can be extracted successfully, but if the layout model classifies the scanned page as a picture cluster, the OCR text is dropped during page assembly and the export becomes only .

I opened a PR to preserve OCR text from picture clusters. This does not claim to fully solve all OCR quality/language issues in #3569, but it fixes one concrete text-loss path I reproduced locally.

cau-git · 2026-06-30T13:04:35Z

@sriharan0804 Thanks for this contribution. We are aware that in-picture text from OCR is lost. The same applies to native text from PDF backends. The reason for making this choice is that random text-level fragments in pictures typically add a lot of noise. To address this properly, the text elements inside the picture need to be detected as blocks with the layout detector, but we are not applying it recursively so far. This could be a future pipeline improvement, which should address the original issue as well.

sriharan0804 · 2026-06-30T13:16:04Z

@cau-git Thanks for the explanation! That makes sense. I understand the concern about introducing noisy OCR fragments from pictures. I appreciate the clarification on the intended design and the future direction with recursive layout detection. I'll keep this in mind while investigating related OCR and layout issues.

cau-git · 2026-06-30T13:32:06Z

@sriharan0804 by the way, you should be able to control this aspect already through the layout options.
pipeline_options.layout_options.create_orphan_clusters = True will create text elements for any text un-matched by the layout detector. It should give you what you ask for, but beware of the noise it creates.

sriharan0804 · 2026-06-30T13:37:46Z

@cau-git Thanks for the clarification! I wasn't aware of the create_orphan_clusters option. That makes sense, and I can see why it's disabled by default given the amount of noise it can introduce. I'll experiment with that configuration and keep the future recursive layout approach in mind. Thanks for taking the time to explain it!

fix: preserve OCR text from picture clusters

da9510a

sriharan0804 added 3 commits June 23, 2026 22:33

DCO Remediation Commit for <commit_sha>

9faace1

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>

DCO Remediation Commit

ff2f51b

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: preserve OCR text from picture clusters during page assembly#3686

fix: preserve OCR text from picture clusters during page assembly#3686
sriharan0804 wants to merge 4 commits into
docling-project:mainfrom
sriharan0804:fix-ocr-picture-cluster-text-loss

sriharan0804 commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 23, 2026 •

edited

Loading

🟢 Enforce conventional commit

Uh oh!

sriharan0804 commented Jun 23, 2026

Uh oh!

cau-git commented Jun 30, 2026

Uh oh!

sriharan0804 commented Jun 30, 2026

Uh oh!

cau-git commented Jun 30, 2026

Uh oh!

sriharan0804 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sriharan0804 commented Jun 23, 2026

Summary

Changes

Reproduction

Notes

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

sriharan0804 commented Jun 23, 2026

Uh oh!

cau-git commented Jun 30, 2026

Uh oh!

sriharan0804 commented Jun 30, 2026

Uh oh!

cau-git commented Jun 30, 2026

Uh oh!

sriharan0804 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 23, 2026 •

edited

Loading

mergify Bot commented Jun 23, 2026 •

edited

Loading