Skip to content

fix: preserve OCR text from picture clusters during page assembly#3686

Open
sriharan0804 wants to merge 4 commits into
docling-project:mainfrom
sriharan0804:fix-ocr-picture-cluster-text-loss
Open

fix: preserve OCR text from picture clusters during page assembly#3686
sriharan0804 wants to merge 4 commits into
docling-project:mainfrom
sriharan0804:fix-ocr-picture-cluster-text-loss

Conversation

@sriharan0804

Copy link
Copy Markdown

Summary

While investigating #3569, I found that OCR text extracted from scanned PDFs can be lost during page assembly when the layout model classifies the page as a picture cluster.

In these cases:

  • OCR extraction succeeds and produces valid TextCell objects.
  • OCR cells are correctly stored in parsed_page.textline_cells.
  • The layout model classifies the region as a picture.
  • PageAssembleModel creates a FigureElement/picture element and the OCR text is not included in the exported document.
  • Markdown and text exports therefore contain only <!-- image --> even though OCR text is available.

Changes

This PR adds a fallback in PageAssembleModel:

  • When a picture cluster contains OCR-generated text cells, a TextElement is created from those OCR cells.
  • The extracted text is preserved in document assembly and becomes available in Markdown/Text export.
  • Existing figure handling remains unchanged for picture clusters that do not contain OCR text.

Reproduction

Before this change:

OCR extraction -> succeeds
OCR cells -> present
Export -> <!-- image -->
Text output -> empty

After this change:

OCR extraction -> succeeds
OCR cells -> present
Export -> OCR text preserved
Text output -> contains extracted text

Notes

This PR addresses one root cause discovered during the investigation of #3569. Additional OCR-quality or language-specific issues may still require separate investigation.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @sriharan0804, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
…oreply.github.com>

I, sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>, hereby add my Signed-off-by to this commit: da9510a

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
@sriharan0804

Copy link
Copy Markdown
Author

@Yuanyi1104

While investigating this, I found one related root cause: OCR text can be extracted successfully, but if the layout model classifies the scanned page as a picture cluster, the OCR text is dropped during page assembly and the export becomes only <!-- image -->.

I opened a PR to preserve OCR text from picture clusters. This does not claim to fully solve all OCR quality/language issues in #3569, but it fixes one concrete text-loss path I reproduced locally.

@cau-git

cau-git commented Jun 30, 2026

Copy link
Copy Markdown
Member

@sriharan0804 Thanks for this contribution. We are aware that in-picture text from OCR is lost. The same applies to native text from PDF backends. The reason for making this choice is that random text-level fragments in pictures typically add a lot of noise. To address this properly, the text elements inside the picture need to be detected as blocks with the layout detector, but we are not applying it recursively so far. This could be a future pipeline improvement, which should address the original issue as well.

@sriharan0804

Copy link
Copy Markdown
Author

@cau-git Thanks for the explanation! That makes sense. I understand the concern about introducing noisy OCR fragments from pictures. I appreciate the clarification on the intended design and the future direction with recursive layout detection. I'll keep this in mind while investigating related OCR and layout issues.

@cau-git

cau-git commented Jun 30, 2026

Copy link
Copy Markdown
Member

@sriharan0804 by the way, you should be able to control this aspect already through the layout options.
pipeline_options.layout_options.create_orphan_clusters = True will create text elements for any text un-matched by the layout detector. It should give you what you ask for, but beware of the noise it creates.

@sriharan0804

Copy link
Copy Markdown
Author

@cau-git Thanks for the clarification! I wasn't aware of the create_orphan_clusters option. That makes sense, and I can see why it's disabled by default given the amount of noise it can introduce. I'll experiment with that configuration and keep the future recursive layout approach in mind. Thanks for taking the time to explain it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants