-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RowAttributes getter to PageIterator #2971
Conversation
tesseract/src/ccmain/ltrresultiterator.cpp Lines 148 to 154 in 1df0970
|
@amitdo I moved When calling Lines 449 to 452 in 79c3ebb
Still, it seems to me that EDIT So here's the problem: |
Maybe we should add |
I don't think this change is necessary at all. The iterators are merely (different kinds of) high-abstraction interfaces to the internal result structure That result structure gets written to by either
(Both can of course also get called by The results are cleared by Since all you want is to avoid the
This only happens when either I am curious though: What do you use these row attributes for? IIUC there is an internal spline representation of the baseline which you can query via So if you want to make a calculation of the sort |
@bertsky Hey Robert, (1) I still believe the proposed moving of In detail: As mentioned earlier, I observed subtle differences between using PageIterator and LTRIterator. For any person not deeply familiar with the API, like me, this can be quite disturbing and time-consuming to debug, as it’s not clear whether you actually lose some data in the process. I isolated one case (see attached zip file below for code and demo image, that actually does not contain any text) that produces this behavior:
While the Also, I am not sure if there might not be other subtle differences hidden there. One difference between the two code paths is the call to Lines 813 to 825 in 79c3ebb
Lines 1339 to 1346 in 79c3ebb
Also the tesseract/src/ccmain/resultiterator.cpp Lines 449 to 465 in 79c3ebb
tesseract/src/ccmain/pageiterator.cpp Lines 147 to 162 in 79c3ebb
So, to sum up, just being able to use EDIT note though that your answer assured me so far that I now added a check for a (2) Use case. Yes, I use row attributes to compute line polygons. For my use, straight lines are good enough. An interface to the underlying baseline polynomial would be great though of course on the other hand, especially for future users. I believe Tesseract’s functionality in this area is quite unique. I imagine this would be a larger change? |
@amitdo I like the pointsize idea. |
I have not tried it out, but from the looks of it, the following might actually work already:
|
Ok, let's try to get to the bottom of this.
That's probably because tesserocr's Incidentally, I also stopped using
It's perhaps not so disturbing anymore when you know that the LSTM recognition is allowed to inject or replace words/lines into the It's hard to avoid this kind of subtlety with an API as complex as this. But this still does not mean in itself that there's a conceptual difference between result iterators instantiated as
Yes, this ensures that
That's strange indeed. So you get the latter implementation when calling the iterator straight from So you are right, there are significant differences. One should probably always use
It's also badly documented. We should definitely wrap this up and at least fix the API documentation. But I am also more inclined to support your original proposal of moving the row attributes from |
@amitdo could you please elaborate on this? |
tesseract/src/ccmain/ltrresultiterator.cpp Lines 163 to 166 in 1df0970
tesseract/src/ccmain/ltrresultiterator.cpp Lines 177 to 181 in 1df0970
Inside => The pointsize parameter belongs to |
I said it before... |
Oh, I see. Absolutely! This is worth returning, both here and in the more high-level But there is another difficulty with that: point size (as a printing measure) is IIUC a physical dimension, so you would have to relate it to a (reliable) measure of DPI / pixel density. Tesseract's DPI estimation (when the input does not have the meta-data) is based on a very simple heuristic: take the average (median) line height, and multiply by 10. |
Wait, this is already happening in current master: tesseract/src/ccmain/ltrresultiterator.cpp Lines 173 to 180 in 79c3ebb
|
If there is some use for those functions, they can easily be added again. But then we should also document how to use them and perhaps add a Python interface in tesserocr, too. |
No, it's fine. I've got enough to play with. We should focus on getting a proper function for what we actually want, without all the clutter (exposing Anyway, all this was just a minor aspect. We should discuss it in a separate issue. This PR is still about moving |
What's the bottom line? I will try to summarize:
Did I get it right? |
No – at least not AFAIC. My current status is still:
Unless you want something different which I did not grasp yet, @amitdo?
|
I will try again. The |
Oh, that. Sure, that would not hurt either! |
So, is it ok to move the pointsize info from WordFontAttributes() to RowAttributes() and break backward compatibility (5.0 already breaks 4.x compatibility), or is it better to just copy that info? I prefer the move option. |
I am very much in favour of merely copying, because the pointsize estimation has been in there since forever (discounting your fix from 3 years ago to make it also work with LSTMs). As long as we keep up the legacy option, we should not cripple it. |
OK, we will go with the copy option. |
I'd say: On the contrary, improve documentation with our above analysis, and then merge this!
Yes, let's do that by copying the 3 lines from |
|
No, don't close! We agreed that this should in fact be merged, but accompanied by better documentation, so at least future developers (including ourselves) will know what led up to this, and to fill the missing gaps that we identified above. So it should be worked on a little before it is "ready" IMHO. Regarding the other aspect, copying the |
@stweil, what do you think? If you agree, please merge. You can add a comment above the method: |
The pull request was against the 4.1 branch, so we still need that change for git master. |
Can you port it to 5.0? |
This patch was not applied to 5.0.0. |
That's bad and should be fixed. |
Allows row attributes to be accessed via PageIterator without runnning full recognition - in https://github.com/sirfz/tesserocr terms, by running only
AnalyseLayout
and notRecognize
.If one is only interested in baselines and row attributes, but not text detection, this reduces the runtime by about a factor of 10, which is a big deal for processing big collections of 100 000s of pages.
Here's some results for extracting baselines and row attributes from the same page, v1 is with this PR (hash codes in the second line are on extracted baseline and row data which show that the results are identical).
`
v1: using AnalyseLayout:
bbe705a10940601631c9a5f26f3f421a
time elapsed 2.9205799102783203
v2: using Recognize:
bbe705a10940601631c9a5f26f3f421a
time elapsed 25.913426399230957
`
Test code used:
demo.py.zip