ocr metadata #2568

hakankaraoguz · 2024-02-21T08:29:59Z

Hi,

When using auto partitioning to partition pdfs, is it possible to get ocr metadata (quality, used or not etc) when pdf parser falls back to ocr strategy?

christinestraub · 2024-02-21T20:15:33Z

Hi @hakankaraoguz

Can you please share the code you're trying and more details about the OCR metadata you want to get?

hakankaraoguz · 2024-02-21T23:11:30Z

Hi @christinestraub
According to documentation if auto strategy is used , there is no indicator in the element metadata when unstructured falls back to OCR strategy. However here I can see that OCR confidence is extracted in pytesseract. I would like to have the OCR confidence information present along with a strategy flag in the element metadata so that I can filter out low quality text after parsing stage.

hakankaraoguz · 2024-03-05T08:23:18Z

Any updates on this?

christinestraub · 2024-03-05T18:34:22Z

@hakankaraoguz Did you try with hi_res strategy? Is the detection_class_prob metadata field not working for your case?

hkaraoguz · 2024-03-08T10:10:43Z

I will try it out but according to this Article detection_class_prob is about the class confidence of the extracted section (Table, Header etc) in the PDF. I am more interested in having the OCR quality result if the algorithm falls back to OCR. Thank you @christinestraub

MthwRobinson added the ocr Related to optical character recognition (OCR). label May 23, 2024

christinestraub added the enhancement New feature or request label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocr metadata #2568

ocr metadata #2568

hakankaraoguz commented Feb 21, 2024

christinestraub commented Feb 21, 2024

hakankaraoguz commented Feb 21, 2024

hakankaraoguz commented Mar 5, 2024

christinestraub commented Mar 5, 2024

hkaraoguz commented Mar 8, 2024

ocr metadata #2568

ocr metadata #2568

Comments

hakankaraoguz commented Feb 21, 2024

christinestraub commented Feb 21, 2024

hakankaraoguz commented Feb 21, 2024

hakankaraoguz commented Mar 5, 2024

christinestraub commented Mar 5, 2024

hkaraoguz commented Mar 8, 2024