Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocr metadata #2568

Open
hakankaraoguz opened this issue Feb 21, 2024 · 5 comments
Open

ocr metadata #2568

hakankaraoguz opened this issue Feb 21, 2024 · 5 comments
Labels
enhancement New feature or request ocr Related to optical character recognition (OCR).

Comments

@hakankaraoguz
Copy link

Hi,

When using auto partitioning to partition pdfs, is it possible to get ocr metadata (quality, used or not etc) when pdf parser falls back to ocr strategy?

@christinestraub
Copy link
Collaborator

Hi @hakankaraoguz

Can you please share the code you're trying and more details about the OCR metadata you want to get?

@hakankaraoguz
Copy link
Author

Hi @christinestraub
According to documentation if auto strategy is used , there is no indicator in the element metadata when unstructured falls back to OCR strategy. However here I can see that OCR confidence is extracted in pytesseract. I would like to have the OCR confidence information present along with a strategy flag in the element metadata so that I can filter out low quality text after parsing stage.

@hakankaraoguz
Copy link
Author

Any updates on this?

@christinestraub
Copy link
Collaborator

@hakankaraoguz Did you try with hi_res strategy? Is the detection_class_prob metadata field not working for your case?

@hkaraoguz
Copy link

I will try it out but according to this Article detection_class_prob is about the class confidence of the extracted section (Table, Header etc) in the PDF. I am more interested in having the OCR quality result if the algorithm falls back to OCR. Thank you @christinestraub

@MthwRobinson MthwRobinson added the ocr Related to optical character recognition (OCR). label May 23, 2024
@christinestraub christinestraub added the enhancement New feature or request label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ocr Related to optical character recognition (OCR).
Projects
None yet
Development

No branches or pull requests

4 participants