fix: preserve OCR text from picture clusters during page assembly#3686
fix: preserve OCR text from picture clusters during page assembly#3686sriharan0804 wants to merge 4 commits into
Conversation
|
✅ DCO Check Passed Thanks @sriharan0804, all your commits are properly signed off. 🎉 |
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
…oreply.github.com> I, sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>, hereby add my Signed-off-by to this commit: da9510a Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
|
While investigating this, I found one related root cause: OCR text can be extracted successfully, but if the layout model classifies the scanned page as a picture cluster, the OCR text is dropped during page assembly and the export becomes only I opened a PR to preserve OCR text from picture clusters. This does not claim to fully solve all OCR quality/language issues in #3569, but it fixes one concrete text-loss path I reproduced locally. |
|
@sriharan0804 Thanks for this contribution. We are aware that in-picture text from OCR is lost. The same applies to native text from PDF backends. The reason for making this choice is that random text-level fragments in pictures typically add a lot of noise. To address this properly, the text elements inside the picture need to be detected as blocks with the layout detector, but we are not applying it recursively so far. This could be a future pipeline improvement, which should address the original issue as well. |
|
@cau-git Thanks for the explanation! That makes sense. I understand the concern about introducing noisy OCR fragments from pictures. I appreciate the clarification on the intended design and the future direction with recursive layout detection. I'll keep this in mind while investigating related OCR and layout issues. |
|
@sriharan0804 by the way, you should be able to control this aspect already through the layout options. |
|
@cau-git Thanks for the clarification! I wasn't aware of the create_orphan_clusters option. That makes sense, and I can see why it's disabled by default given the amount of noise it can introduce. I'll experiment with that configuration and keep the future recursive layout approach in mind. Thanks for taking the time to explain it! |
Summary
While investigating #3569, I found that OCR text extracted from scanned PDFs can be lost during page assembly when the layout model classifies the page as a
picturecluster.In these cases:
TextCellobjects.parsed_page.textline_cells.picture.PageAssembleModelcreates aFigureElement/picture element and the OCR text is not included in the exported document.<!-- image -->even though OCR text is available.Changes
This PR adds a fallback in
PageAssembleModel:TextElementis created from those OCR cells.Reproduction
Before this change:
After this change:
Notes
This PR addresses one root cause discovered during the investigation of #3569. Additional OCR-quality or language-specific issues may still require separate investigation.