Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RowAttributes getter to PageIterator #2971

Merged
merged 1 commit into from
Apr 12, 2021
Merged

Conversation

poke1024
Copy link
Contributor

@poke1024 poke1024 commented May 6, 2020

Allows row attributes to be accessed via PageIterator without runnning full recognition - in https://github.com/sirfz/tesserocr terms, by running only AnalyseLayout and not Recognize.

If one is only interested in baselines and row attributes, but not text detection, this reduces the runtime by about a factor of 10, which is a big deal for processing big collections of 100 000s of pages.

Here's some results for extracting baselines and row attributes from the same page, v1 is with this PR (hash codes in the second line are on extracted baseline and row data which show that the results are identical).

`
v1: using AnalyseLayout:
bbe705a10940601631c9a5f26f3f421a
time elapsed 2.9205799102783203

v2: using Recognize:
bbe705a10940601631c9a5f26f3f421a
time elapsed 25.913426399230957
`

Test code used:
demo.py.zip

@amitdo
Copy link
Collaborator

amitdo commented May 6, 2020

void LTRResultIterator::RowAttributes(float* row_height, float* descenders,
float* ascenders) const {
*row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
it_->row()->row->descenders();
*descenders = it_->row()->row->descenders();
*ascenders = it_->row()->row->ascenders();
}

@poke1024
Copy link
Contributor Author

poke1024 commented May 7, 2020

@amitdo I moved RowAttributes from LTRResultIterator to PageIterator now, which will not break functionality, as LTRResultIterator derives from PageIterator.

When calling TessBaseAPIAnalyseLayout, one only gets a PageIterator, and it should be possible to extract row attributes from it (without calling a full TessBaseAPIRecognize).

tesseract/src/api/capi.cpp

Lines 449 to 452 in 79c3ebb

TessPageIterator*
TessBaseAPIAnalyseLayout(TessBaseAPI* handle) {
return handle->AnalyseLayout();
}

OK, I now found that you can call TessBaseAPIAnalyseLayout and then obtain row attributes via TessBaseAPIGetIterator's TessResultIterator. (see below).

Still, it seems to me that RowAttributes might belong more in PageIterator than in LTRResultIterator, as it's something that's computed in the layout analysis phase.

EDIT So here's the problem: TessBaseAPIGetIterator sometimes returns NULL after calling TessBaseAPIAnalyseLayout, even though there is a valid PageIterator, and that is why this PR is actually important.

@amitdo
Copy link
Collaborator

amitdo commented May 7, 2020

Looks fine to me.

@zdenop, @stweil, @bertsky, your opinion?

@amitdo
Copy link
Collaborator

amitdo commented May 7, 2020

Maybe we should add pointsize as in-out parameter to RowAttributes.

@bertsky
Copy link
Contributor

bertsky commented May 7, 2020

I don't think this change is necessary at all. The iterators are merely (different kinds of) high-abstraction interfaces to the internal result structure PAGE_RES TessBaseAPI::page_res_.

That result structure gets written to by either

  • AnalyseLayout(), which itself gets called by
    • GetComponentImages(),
    • GetRegions(),
    • GetTextlines()
    • GetStrips()
    • GetWords()
    • GetConnectedComponents()
      via API, or
  • Recognize(), which of course also gets called by
    • GetUTF8Text()
    • GetAllWordConfidences() etc.

(Both can of course also get called by ProcessPage(), depending on PSM.)

The results are cleared by End() or ClearResults() (which in turn are called by Clear() or subsequent SetImage() etc). As long as they live though, you can instantiate any number of iterators you want (each with distinct state).

Since all you want is to avoid the Recognize() line of events (with recog_all_words()), while accessing the row attributes which are in page_res_ after AnalyseLayout(), why don't you just use TessBaseAPI::GetLTRIterator(), which will give you the right iterator flavour?

So here's the problem: TessBaseAPIGetIterator sometimes returns NULL after calling TessBaseAPIAnalyseLayout, even though there is a valid PageIterator, and that is why this PR is actually important.

This only happens when either AnalyseLayout() already had a problem the first time (so you might have to change parameters or image rectangle or whatnot), or you called something in between which invalidates the page_res_ or tesseract_ state (which you can surely avoid).

I am curious though: What do you use these row attributes for? IIUC there is an internal spline representation of the baseline which you can query via y = ROW::base_line(x) – but only if you get access to the ROW objects themselves. But the API does not expose them. It only gives you PageIterator::Baseline() which unfortunately only is a straight line (from left of left-most word bbox to right of the right-most word bbox), not a line string.

So if you want to make a calculation of the sort polygon = baseline_points - descenders + x_height + ascenders then we would indeed need a change in the API, maybe a second polymorphic form of PageIterator::Baseline() which takes a tuple vector reference, not just start and end points.

@poke1024
Copy link
Contributor Author

poke1024 commented May 8, 2020

@bertsky Hey Robert,

(1) I still believe the proposed moving of RowAttributes to PageIterator makes sense conceptually and will fix some nasty border cases that are, at least to me, unclear in terms of whether they work as intended or actually produce wrong data.

In detail:

As mentioned earlier, I observed subtle differences between using PageIterator and LTRIterator. For any person not deeply familiar with the API, like me, this can be quite disturbing and time-consuming to debug, as it’s not clear whether you actually lose some data in the process.

I isolated one case (see attached zip file below for code and demo image, that actually does not contain any text) that produces this behavior:

demo.zip

================================================================================
using AnalyseLayout
================================================================================
ri: None
found: None
Traceback (most recent call last):
  File "demo.py", line 24, in <module>
    for r in tesserocr.iterate_level(ri, tesserocr.RIL.TEXTLINE):
  File "tesserocr.pyx", line 1119, in iterate_level
AttributeError: 'NoneType' object has no attribute 'Next'

================================================================================
using Recognize
================================================================================
ri: <tesserocr.PyResultIterator object at 0x10ece4210>
found: <tesserocr.PyResultIterator object at 0x10ece4210>

While the Recognize case works fine and as expected, calling AnalyseLayout and then getting the LTRIterator first produces a None value, then breaks tesserocr; this boils down to TessBaseAPI::GetIterator() returning NULL in the first case; for this example image, this is correct, since the image contains no text, but the behavior is different from Recognize, and that is the disturbing bit for any user not familiar with the API. Without extensive debugging it is not clear why this should be the case and if this is a bigger underlying issue that also breaks correct images (even if it does not in fact).

Also, I am not sure if there might not be other subtle differences hidden there. One difference between the two code paths is the call to ResultIterator::StartOfParagraph in the LTRIterator case (yes it's only a copy constructor, but still):

PageIterator* TessBaseAPI::AnalyseLayout(bool merge_similar_words) {
if (FindLines() == 0) {
if (block_list_->empty())
return nullptr; // The page was empty.
page_res_ = new PAGE_RES(merge_similar_words, block_list_, nullptr);
DetectParagraphs(false);
return new PageIterator(
page_res_, tesseract_, thresholder_->GetScaleFactor(),
thresholder_->GetScaledYResolution(),
rect_left_, rect_top_, rect_width_, rect_height_);
}
return nullptr;
}

tesseract/src/api/baseapi.cpp

Lines 1339 to 1346 in 79c3ebb

ResultIterator* TessBaseAPI::GetIterator() {
if (tesseract_ == nullptr || page_res_ == nullptr)
return nullptr;
return ResultIterator::StartOfParagraph(LTRResultIterator(
page_res_, tesseract_,
thresholder_->GetScaleFactor(), thresholder_->GetScaledYResolution(),
rect_left_, rect_top_, rect_width_, rect_height_));
}

Also the ResultIterator::Next does things differently than PageIterator::Next. It’s not obvious to me, as an API outsider, that all this does not make any difference:

bool ResultIterator::Next(PageIteratorLevel level) {
if (it_->block() == nullptr)
return false; // already at end!
switch (level) {
case RIL_BLOCK: // explicit fall-through
case RIL_PARA: // explicit fall-through
case RIL_TEXTLINE:
if (!PageIterator::Next(level))
return false;
if (IsWithinFirstTextlineOfParagraph()) {
// if we've advanced to a new paragraph,
// recalculate current_paragraph_is_ltr_
current_paragraph_is_ltr_ = CurrentParagraphIsLtr();
}
in_minor_direction_ = false;
MoveToLogicalStartOfTextline();
return it_->block() != nullptr;

bool PageIterator::Next(PageIteratorLevel level) {
if (it_->block() == nullptr) return false; // Already at the end!
if (it_->word() == nullptr)
level = RIL_BLOCK;
switch (level) {
case RIL_BLOCK:
it_->forward_block();
break;
case RIL_PARA:
it_->forward_paragraph();
break;
case RIL_TEXTLINE:
for (it_->forward_with_empties(); it_->row() == it_->prev_row();
it_->forward_with_empties());
break;

So, to sum up, just being able to use PageIterator in the first place without having to resort to an LTRIterator would give me peace of mind, I guess.

EDIT note though that your answer assured me so far that I now added a check for a None iterator in my productive batch code and will use that for the time being. So we're more talking about future users now.

(2) Use case. Yes, I use row attributes to compute line polygons. For my use, straight lines are good enough. An interface to the underlying baseline polynomial would be great though of course on the other hand, especially for future users. I believe Tesseract’s functionality in this area is quite unique. I imagine this would be a larger change?

@poke1024
Copy link
Contributor Author

poke1024 commented May 8, 2020

@amitdo I like the pointsize idea.

@bertsky
Copy link
Contributor

bertsky commented May 8, 2020

but only if you get access to the ROW objects themselves. But the API does not expose them.

I have not tried it out, but from the looks of it, the following might actually work already:

  • with FindLinesCreateBlockList() (which does the same FindLines() as AnalyseLayout() does, only without a page iterator result) you get a BLOCK_LIST*; beware though that this also deletes block_list_ so you won't be able to get/use a page iterator afterwards (except if you call AnalyseLayout() again)
  • pass that to FindRowForBox() along with the coordinates of what you already know belongs to the textline; this should get you a ROW*
  • use bounding_box() to get a left and right x, then iterate over this range in a number of points with base_line(); then subtract descenders() for lower margin and x_height() plus ascenders() for upper margin
  • convert everything from internal to external coords (unscale, unrotate, offset from rectangle, clip to rectangle)
    (I know, both functions are not exposed by tesserocr yet.)

@bertsky
Copy link
Contributor

bertsky commented May 8, 2020

Ok, let's try to get to the bottom of this.

While the Recognize case works fine and as expected, calling AnalyseLayout and then getting the LTRIterator first produces a None value, then breaks tesserocr; this boils down to TessBaseAPI::GetIterator() returning NULL in the first case;

That's probably because tesserocr's iterate_level is not correct itself: it never checks the Empty() status of the iterator – as nearly every usage I have seen in C++ does.

Incidentally, I also stopped using iterate_level in all of ocrd_tesserocr at some point – maybe through similar painful experiences as you, but not following up on them.

for this example image, this is correct, since the image contains no text, but the behavior is different from Recognize, and that is the disturbing bit for any user not familiar with the API.

It's perhaps not so disturbing anymore when you know that the LSTM recognition is allowed to inject or replace words/lines into the page_res_ tree. So the rule-based segmenter might not have been able to find any text, but then the recognition was confident it saw any when asked.

It's hard to avoid this kind of subtlety with an API as complex as this.

But this still does not mean in itself that there's a conceptual difference between result iterators instantiated as new PageIterator or as new LTRResultIterator. But let's see...

Also, I am not sure if there might not be other subtle differences hidden there. One difference between the two code paths is the call to ResultIterator::StartOfParagraph in the LTRIterator case (yes it's only a copy constructor, but still):

Yes, this ensures that GetIterator will behave like ResultIterator where it deviates from its parents LTRResultIterator and PageIterator. Learning more by the minute!

Also the ResultIterator::Next does things differently than PageIterator::Next.

That's strange indeed. So you get the latter implementation when calling the iterator straight from AnalyseLayout/Recognize (or from GetLTRIterator which inherits it), but the former when using GetIterator (or any of its dependents). The former also looks more sophisticated, watching over textline order within paragraphs.

So you are right, there are significant differences. One should probably always use GetIterator when the logical (reading) ordering/direction is important.

It’s not obvious to me, as an API outsider, that all this does not make any difference

It's also badly documented. We should definitely wrap this up and at least fix the API documentation.

But I am also more inclined to support your original proposal of moving the row attributes from LTRResultIterator (which you cannot access without loosing the sophisticatino of ResultIterator::Next) to PageIterator.

@bertsky
Copy link
Contributor

bertsky commented May 8, 2020

Maybe we should add pointsize as in-out parameter to RowAttributes.

@amitdo could you please elaborate on this?

@amitdo
Copy link
Collaborator

amitdo commented May 8, 2020

// Pointsize is returned in printers points (1/72 inch.)
const char* LTRResultIterator::WordFontAttributes(
bool* is_bold, bool* is_italic, bool* is_underlined, bool* is_monospace,
bool* is_serif, bool* is_smallcaps, int* pointsize, int* font_id) const {

*pointsize =
scaled_yres_ > 0
? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
: 0;

Inside WordFontAttributes, pointsize is calculated from row height

=> The pointsize parameter belongs to RowAttributes.

@amitdo
Copy link
Collaborator

amitdo commented May 8, 2020

I said it before...
#1074 (comment)
(Sep 2017)

@bertsky
Copy link
Contributor

bertsky commented May 8, 2020

Inside WordFontAttributes, pointsize is calculated from row height

=> The pointsize parameter belongs to RowAttributes.

Oh, I see. Absolutely! This is worth returning, both here and in the more high-level WordFontAttributes. (And I don't think word is a total misfit here, because LSTMs are applied on that hierarchy level, even if they then may segment the "word" further.)

But there is another difficulty with that: point size (as a printing measure) is IIUC a physical dimension, so you would have to relate it to a (reliable) measure of DPI / pixel density. Tesseract's DPI estimation (when the input does not have the meta-data) is based on a very simple heuristic: take the average (median) line height, and multiply by 10.

@bertsky
Copy link
Contributor

bertsky commented May 8, 2020

Oh, I see. Absolutely! This is worth returning, both here and in the more high-level WordFontAttributes. (And I don't think word is a total misfit here, because LSTMs are applied on that hierarchy level, even if they then may segment the "word" further.)

Wait, this is already happening in current master:

float row_height = it_->row()->row->x_height() +
it_->row()->row->ascenders() -
it_->row()->row->descenders();
// Convert from pixels to printers points.
*pointsize =
scaled_yres_ > 0
? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
: 0;

@amitdo
Copy link
Collaborator

amitdo commented May 8, 2020

@bertsky,

FindLinesCreateBlockList() and FindRowForBox() were removed in #2957.

@stweil
Copy link
Member

stweil commented May 8, 2020

FindLinesCreateBlockList() and FindRowForBox() were removed in #2957.

If there is some use for those functions, they can easily be added again. But then we should also document how to use them and perhaps add a Python interface in tesserocr, too.

@bertsky
Copy link
Contributor

bertsky commented May 8, 2020

If there is some use for those functions, they can easily be added again. But then we should also document how to use them and perhaps add a Python interface in tesserocr, too.

No, it's fine. I've got enough to play with. We should focus on getting a proper function for what we actually want, without all the clutter (exposing BLOCK_LIST and ROW, invalidating block_list_). Maybe we can even use the existing iterator API for the baseline-based polygon retrieval – e.g. by allowing BlockPolygon on RIL_TEXTLINE`.

Anyway, all this was just a minor aspect. We should discuss it in a separate issue. This PR is still about moving RowAttributes.

@amitdo
Copy link
Collaborator

amitdo commented May 9, 2020

@bertsky,

What's the bottom line?

I will try to summarize:

  1. Keep RowAttributes() where it is now (Close PR without merging).
  2. Add pointsize as in-out parameter to RowAttributes() in a new PR.

Did I get it right?

@bertsky
Copy link
Contributor

bertsky commented May 9, 2020

I will try to summarize:

1. Keep `RowAttributes()` where it is now (Close PR without merging).

2. Add `pointsize` as in-out parameter to `RowAttributes()` in a new PR.

Did I get it right?

No – at least not AFAIC.

My current status is still:

Also the ResultIterator::Next does things differently than PageIterator::Next.

So you are right, there are significant differences. One should probably always use GetIterator when the logical (reading) ordering/direction is important.

It’s not obvious to me, as an API outsider, that all this does not make any difference

It's also badly documented. We should definitely wrap this up and at least fix the API documentation.

But I am also more inclined to support your original proposal of moving the row attributes from LTRResultIterator (which you cannot access without loosing the sophistication of ResultIterator::Next) to PageIterator.

Maybe we should add pointsize as in-out parameter to RowAttributes.

Wait, this is already happening in current master:

Unless you want something different which I did not grasp yet, @amitdo?

So if you want to make a calculation of the sort polygon = baseline_points - descenders + x_height + ascenders then we would indeed need a change in the API, maybe a second polymorphic form of PageIterator::Baseline() which takes a tuple vector reference, not just start and end points.

Maybe we can even use the existing iterator API for the baseline-based polygon retrieval – e.g. by allowing BlockPolygon on RIL_TEXTLINE`.

@amitdo
Copy link
Collaborator

amitdo commented May 9, 2020

Wait, this is already happening in current master:

Unless you want something different which I did not grasp yet, @amitdo?

I will try again.

The pointsize is inside WordFontAttributes().
It is not logical to put it there - you get the same results for every word in the row. The right place for it is RowAttributes.

@bertsky
Copy link
Contributor

bertsky commented May 9, 2020

The pointsize is inside WordFontAttributes().
It is not logical to put it there - you get the same results for every word in the row. The right place for it is RowAttributes.

Oh, that. Sure, that would not hurt either!

@amitdo
Copy link
Collaborator

amitdo commented May 9, 2020

Sure, that would not hurt either!

So, is it ok to move the pointsize info from WordFontAttributes() to RowAttributes() and break backward compatibility (5.0 already breaks 4.x compatibility), or is it better to just copy that info?

I prefer the move option.

@bertsky
Copy link
Contributor

bertsky commented May 9, 2020

So, is it ok to move the pointsize info from WordFontAttributes() to RowAttributes() and break backward compatibility (5.0 already breaks 4.x compatibility), or is it better to just copy that info?

I am very much in favour of merely copying, because the pointsize estimation has been in there since forever (discounting your fix from 3 years ago to make it also work with LSTMs). As long as we keep up the legacy option, we should not cripple it.

@amitdo
Copy link
Collaborator

amitdo commented May 9, 2020

OK, we will go with the copy option.

@bertsky
Copy link
Contributor

bertsky commented Aug 26, 2020

What's the bottom line?

I will try to summarize:

1. Keep `RowAttributes()` where it is now (Close PR without merging).

I'd say: On the contrary, improve documentation with our above analysis, and then merge this!

2. Add `pointsize` as in-out parameter to `RowAttributes()` in a new PR.

Yes, let's do that by copying the 3 lines from WordFontAttributes to RowAttributes.

@zdenop
Copy link
Contributor

zdenop commented Oct 10, 2020

Yes, let's do that by copying the 3 lines from WordFontAttributes to RowAttributes
Did it happens? Can I close this PR?

@bertsky
Copy link
Contributor

bertsky commented Oct 10, 2020

Did it happens? Can I close this PR?

No, don't close!

We agreed that this should in fact be merged, but accompanied by better documentation, so at least future developers (including ourselves) will know what led up to this, and to fill the missing gaps that we identified above. So it should be worked on a little before it is "ready" IMHO.

Regarding the other aspect, copying the pointsize calculation from WordFontAttributes to RowAttributes (and exposing it with an additional pointer), no this has not been done yet. It could be done here or in a separate PR.

@amitdo
Copy link
Collaborator

amitdo commented Apr 12, 2021

We agreed that this should in fact be merged

@stweil, what do you think? If you agree, please merge.

You can add a comment above the method:
// NOTE: See #2971.

@stweil stweil merged commit 3cedf19 into tesseract-ocr:4.1 Apr 12, 2021
@stweil
Copy link
Member

stweil commented Apr 12, 2021

The pull request was against the 4.1 branch, so we still need that change for git master.

@amitdo
Copy link
Collaborator

amitdo commented Apr 30, 2021

@stweil,

Can you port it to 5.0?

@amitdo
Copy link
Collaborator

amitdo commented Nov 19, 2021

This patch was not applied to 5.0.0.

@stweil stweil added this to the 5.0.0 milestone Nov 19, 2021
@stweil
Copy link
Member

stweil commented Nov 19, 2021

That's bad and should be fixed.

@stweil
Copy link
Member

stweil commented Nov 19, 2021

@poke1024, do you already use the latest code? Then maybe you could test your patch with Tesseract 5.0.0 in pull request #3651.

@amitdo amitdo added the API label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants