-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add evaluation benchmarks #43
Comments
I'm currently working on a synthetic data generator. This has the advantage that it can provide coverage of many languages and domains, as long as a suitable source of text samples (eg. Wikipedia) is available. Suggestions for additional datasets are welcome in the ocrs-models repo. The main requirement is that they be openly licensed for any use (requiring attribution ala. CC-BY-SA is OK).
I agree. I plan to publish metrics for the HierText dataset, which is the main dataset on which the models are trained. Additional benchmarks (for whatever datasets people are interested in) are an area where contributions are welcome. |
https://github.com/Belval/TextRecognitionDataGenerator also sounds interesting for this. |
I started with this project. I found that a recognition model trained on output from an unmodified version of it achieves very low error rate in training, but doesn't generalize well when used with Ocrs. So I'm exploring changes to improve this (preprocessing adjustments, more varied backgrounds, more varied fonts etc). One reason is that Ocrs's preprocessing cuts lines out of the surrounding image, to avoid ambiguity over which text should be recognized, as a simple rectangular cut-out might include other text. Example (produced by Synthetic generators need to be modified to apply similar masking. |
also curious about performance of OCRs compared to things like:
i am using apple & windows native OCR in but looking for solution for linux to replace tesseract which is complete garbage |
There are a few dimensions to consider:
For model size, I consider OCR models "small" if they have a few million parameters, and large if they have hundreds or more. On those axes:
|
Thanks for creating this package!
As discussed in #14 it would be nice to add some evaluation benchmarks. And maybe optionally compare with tesseract or some other reference open source OCR.
What datasets were you considering?
There is for instance the SROIE dataset of scanned recipes. The dataset can be found here (couldn't find a more official source). In particular there are two task described in their paper,
The text was updated successfully, but these errors were encountered: