Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add evaluation benchmarks #43

Open
rth opened this issue Mar 30, 2024 · 5 comments
Open

Add evaluation benchmarks #43

rth opened this issue Mar 30, 2024 · 5 comments

Comments

@rth
Copy link

rth commented Mar 30, 2024

Thanks for creating this package!

As discussed in #14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.

What datasets were you considering?

There is for instance the SROIE dataset of scanned recipes. The dataset can be found here (couldn't find a more official source). In particular there are two task described in their paper,

  • Task 1 - Scanned Receipt Text Localisation. Though I didn't get how the evaluation works exactly after skimming their paper.
  • Task 2 - Scanned Receipt OCR. Computing precision, recall and F1 score for all words (space tokenized) extracted from the document, as far as I understand.
@robertknight
Copy link
Owner

What datasets were you considering?

I'm currently working on a synthetic data generator. This has the advantage that it can provide coverage of many languages and domains, as long as a suitable source of text samples (eg. Wikipedia) is available.

Suggestions for additional datasets are welcome in the ocrs-models repo. The main requirement is that they be openly licensed for any use (requiring attribution ala. CC-BY-SA is OK).

As discussed in #14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.

I agree. I plan to publish metrics for the HierText dataset, which is the main dataset on which the models are trained. Additional benchmarks (for whatever datasets people are interested in) are an area where contributions are welcome.

@rth
Copy link
Author

rth commented Mar 30, 2024

I'm currently working on a synthetic data generator.

https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.

@robertknight
Copy link
Owner

https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this.

I started with this project. I found that a recognition model trained on output from an unmodified version of it achieves very low error rate in training, but doesn't generalize well when used with Ocrs. So I'm exploring changes to improve this (preprocessing adjustments, more varied backgrounds, more varied fonts etc).

One reason is that Ocrs's preprocessing cuts lines out of the surrounding image, to avoid ambiguity over which text should be recognized, as a simple rectangular cut-out might include other text. Example (produced by ocrs image.jpeg --text-line-images):

line-10

Synthetic generators need to be modified to apply similar masking.

@louis030195
Copy link

also curious about performance of OCRs compared to things like:

i am using apple & windows native OCR in
https://github.com/louis030195/screen-pipe

but looking for solution for linux to replace tesseract which is complete garbage

@robertknight
Copy link
Owner

There are a few dimensions to consider:

  • Model size: Larger models can store more knowledge/patterns but are slower to execute and use more memory.
  • Functionality: Some models can both detect and read text in an image, others only recognize text in a line image
  • Linguistic and world knowledge: If the model is multimodal, it might be able to use that knowledge to disambiguate (eg. to understand text by looking at the context in a photo)
  • Training data: Is the model trained to recognize printed text, handwritten text etc.?

For model size, I consider OCR models "small" if they have a few million parameters, and large if they have hundreds or more.

On those axes:

  • Apple's native OCR: Small model. Does detection + recognition. Very good accuracy in my experience.
  • Windows native OCR: I'm not familiar with it, but I would expect it to be in a similar class to Apple's solution
  • Tesseract: Small model. Does detection + recognition, albeit the detection is crude. The accuracy can be good for clean document images (dark text, light background, straight, low background clutter), but can produce poor results if image has inverted colors, complex backgrounds etc.
  • Ocrs: Small model. Does detection + recognition. Accuracy is not as good as Apple's OCR. More tolerant of variation in background, colors etc. than Tesseract. Accuracy vs Tesseract varies depending on the image. Layout analysis (understanding of reading order) is quite dumb.
  • TrOCR: Large model (330M params for base), there is a medium-sized (66M params) "small" one as well. Does recognition only. Comes in printed and handwritten fine-tunes. Better accuracy than Tesseract and Ocrs but more expensive to run.
  • Multimodal LLM: Large models (parameter counts usually measured in billions). Can take advantage of its linguistic and world knowledge to understand text, but also much more expensive to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants