Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method WordFontAttributes does not work #1074

Closed
zikcheng opened this issue Aug 11, 2017 · 41 comments
Closed

Method WordFontAttributes does not work #1074

zikcheng opened this issue Aug 11, 2017 · 41 comments

Comments

@zikcheng
Copy link

Environment

  • Tesseract Version: tesseract 4.00.00alpha
  • Commit Number: 8e55e52
  • Platform: Ubuntu 16.04.1

Current Behavior:

Method WordFontAttributes returns null if using tesseract 4.00.00alpha with 4.00 tessdata, but it returns font name if using tesseract 4.00.00alpha with 3.04.00 tessdata. The test image link is eurotext.tif
I first met this problem when I use tesserocr [tesserocr#68] .(sirfz/tesserocr#68)

Expected Behavior:

With method WordFontAttributes we can get correct font attributes of recognized words.

@amitdo
Copy link
Collaborator

amitdo commented Aug 11, 2017

The new LSTM engine does not support this feature and probably won't support it any time soon.

@phildrip
Copy link

Is there an alternative way to get font sizing etc? Do you mean that just this method won't be supported, or the feature in general?

@amitdo
Copy link
Collaborator

amitdo commented Aug 31, 2017

Is there an alternative way to get font sizing etc?

You can still use --oem 0 with traineddata from here: https://github.com/tesseract-ocr/tessdata.
Note that the traineddata in the 'best' folder won't work with --oem 0.

@amitdo
Copy link
Collaborator

amitdo commented Aug 31, 2017

Do you mean that just this method won't be supported, or the feature in general?

I have reasons to believe that the new LSTM engine is unlikely to have a feature that includes font identification (name and properties like is_bold) in the near future.

Important note: I'm a contributer from the community, and the main developer not always shares all his plans for upcoming release(s) with the community.

@phildrip
Copy link

phildrip commented Sep 1, 2017

Thanks for the reply! It looks like the old ocr engine is going to be removed, though (issue #707)... And does using OcrEngineMode 0 mean the behaviour is the same as v3?

What I'm getting to is:

  1. I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?
  2. If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?
  3. Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?

Thanks again for the help!

@amitdo
Copy link
Collaborator

amitdo commented Sep 1, 2017

It looks like the old ocr engine is going to be removed, though (issue #707)...

It's not known when exactly it will be removed. Until then you can still use it.

And does using OcrEngineMode 0 mean the behaviour is the same as v3?

It's basically the same as 3.05.01.

I need to be able to extract font size information (font names aren't so useful) - is there any way at all of doing so with LSTM/v4?

There is no method in the API to get font sizes for the lstm engine.

If I use OcrEngineMode 0 to be able to get this info, will that be removed from v4 at a later date?

Probably yes.

Is there any advantage to using v4 with OcrEngineMode 0 vs v3.05?

The accuracy should be the same.

@amitdo
Copy link
Collaborator

amitdo commented Sep 1, 2017

The relative font size for a textline can be estimated by calculating the xheight of the line and compare it
to the median xheight of the other textlines in the page.

@phildrip
Copy link

phildrip commented Sep 1, 2017

Ok, thanks for the info 👍

@amitdo
Copy link
Collaborator

amitdo commented Sep 1, 2017

@phildrip,

I looked at the relevant code again, and I think the font size functionality (but not font name and properties like is_bold) can be restored when using the lstm engine.

I will provide further details (and probably send a PR) in the upcoming days.

@phildrip
Copy link

phildrip commented Sep 1, 2017

That's great news, thanks!

@amitdo
Copy link
Collaborator

amitdo commented Sep 2, 2017

// Returns the font attributes of the current word. If iterating at a higher
// level object than words, eg textlines, then this will return the
// attributes of the first word in that textline.
// The actual return value is a string representing a font name. It points
// to an internal table and SHOULD NOT BE DELETED. Lifespan is the same as
// the iterator itself, ie rendered invalid by various members of
// TessBaseAPI, including Init, SetImage, End or deleting the TessBaseAPI.
// Pointsize is returned in printers points (1/72 inch.)
const char* LTRResultIterator::WordFontAttributes(bool* is_bold,
                                                  bool* is_italic,
                                                  bool* is_underlined,
                                                  bool* is_monospace,
                                                  bool* is_serif,
                                                  bool* is_smallcaps,
                                                  int* pointsize,
                                                  int* font_id) const {
  if (it_->word() == NULL) return NULL;  // Already at the end!
  if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
  }
  const FontInfo& font_info = *it_->word()->fontinfo;
  *font_id = font_info.universal_id;
  *is_bold = font_info.is_bold();
  *is_italic = font_info.is_italic();
  *is_underlined = false;  // TODO(rays) fix this!
  *is_monospace = font_info.is_fixed_pitch();
  *is_serif = font_info.is_serif();
  *is_smallcaps = it_->word()->small_caps;
  float row_height = it_->row()->row->x_height() +
      it_->row()->row->ascenders() - it_->row()->row->descenders();
  // Convert from pixels to printers points.
  *pointsize = scaled_yres_ > 0
      ? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
      : 0;

  return font_info.name;
}

The problem:

if (it_->word()->fontinfo == NULL) {
    *font_id = -1;
    return NULL;  // No font information.
}

With the LSTM engine the it_->word()->fontinfo will always be NULL.
So pointsize has no chance to be calculated.

pointsize is calculated based on row (=line) height. pointsize is the font size in points of the line, so it should not be in WordFontAttributes().

There is another function where you can get row height.

void LTRResultIterator::RowAttributes(float* row_height, float* descenders,
                                      float* ascenders) const {
  *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
                it_->row()->row->descenders();
  *descenders = it_->row()->row->descenders();
  *ascenders = it_->row()->row->ascenders();
}

I think pointsize calculation should be moved into this function.

@amitdo
Copy link
Collaborator

amitdo commented Sep 3, 2017

@zdenop, @stweil
Do you have any comment?

@zdenop
Copy link
Contributor

zdenop commented Sep 5, 2017

At the moment I have a limited internet access. If you make a pull request I can merge it ;.-)

@stweil
Copy link
Member

stweil commented Sep 6, 2017

Although my current main focus is getting the text from images, there are also important use cases where text attributes are important as well. As I understand your comments, currently the new LSTM recognizer does not support the method WordFontAttributes, so it is not possible to get text attributes with that recognizer. Adding support for the font size recognition with LSTM seems to be feasible, but other text attributes like for example bold or italic are desirable, too.

@theraysmith
Copy link
Contributor

It would be feasible to add bold and italic attributes by making them a separate output from the model.
Underline would also be possible.
All these attributes would require changes to the rendering pipeline, and datapath for the ground truth.
Fixed-pitch(monospace), serif and smallcaps would be much more difficult, due to lack of reliable data available for the fonts. It could be possible to re-use the existing fontinfo table for that.
I wouldn't rule it out as impossible, but I will add this request to my list of stoppers for obsoleting the old engine.
I have a bunch of updates to push, which I didn't quite get to before my office move...

@stweil
Copy link
Member

stweil commented Sep 7, 2017

Thank you for this clarification, Ray.

@amitdo
Copy link
Collaborator

amitdo commented Sep 7, 2017

Thank you for this clarification, Ray.

+1

Ray,
In the meantime, can I fix the font size issue?
#1074 (comment)

@theraysmith
Copy link
Contributor

Yes of course. Just re-order the code in WordFontAttributes.

@amitdo
Copy link
Collaborator

amitdo commented Sep 7, 2017

Yes of course. Just re-order the code in WordFontAttributes.

That was my first thought, but it seems to give you font size in the line level, while the name of the method implies otherwise (WordFontAttributese), so I suggested to move pointsize to the RowAttributes() method.

@Shreeshrii
Copy link
Collaborator

It would be feasible to add bold and italic attributes by making them a separate output from the model. Underline would also be possible.

You could also take bold/italic into account when people use multiple languages for recognition, because many times the words in the additional language may be emphasized with bold or italics..

For an example, see the image in tesseract-ocr/langdata#4 (comment) where Roman transliteration of Hindi is italicized with English text.

@Shreeshrii
Copy link
Collaborator

it seems to give you font size in the line level

While that would work in most cases, what of an extreme case of text of different size being on the same line - eg. http://www.teach-ict.com/programming/html/intro/step17a.jpg

@theraysmith
Copy link
Contributor

That has always been a problem.
The old code would often output garbage.
The LSTM engine will split the line at such words and recognize them separately, pasting the results back together. It doesn\t give an estimate of the x-height though. The overall accuracy on such images is better though.

@Shreeshrii
Copy link
Collaborator

@theraysmith Please see related issue #538

regarding recognition problems when an image has many different font sizes in it.

@vtigranv
Copy link

+1

@troplin
Copy link

troplin commented Jun 28, 2018

IMO the current state of this method is not very satisfying.
In version 3, it was clear that no information was available if the method returned NULL.

Now in version 4 with LSTM, the method returns NULL, but the font size is still computed. The rest of the properties currently seem to be set to true unconditionally.
It's not possible to find out, if those are actually correct or just garbage.

At least the method should not change the values, if the information is not available.

@amitdo
Copy link
Collaborator

amitdo commented Jun 28, 2018

It's not possible to find out, if those are actually correct or just garbage.

What's the value of font_id?

@troplin
Copy link

troplin commented Jun 28, 2018

font_id is -1.
I realize that I can probably just assume that the font size is always correct and the rest only if the method returns something != NULL or if font_id != -1.

But that's just implicit knowledge and not at all clear from the signature.
And going forward, if e.g. the bold property is correctly recognized too in a future version, there's no way to recognize that.
I'd very much prefer an API where it is inherently clear which properties are meaningful and which aren't, without relying on implicit knowledge.

@amitdo
Copy link
Collaborator

amitdo commented Jun 28, 2018

See also #1074 (comment)

@hoangaeye
Copy link

Do we have a solution for this?

@amitdo
Copy link
Collaborator

amitdo commented Jan 21, 2020

As you can see the issue is still open.

It's unknown when font name, bold and italic identification will be supported for the LSTM engine.

@hoangaeye
Copy link

is there another method or package that can determine font size?

@amitdo
Copy link
Collaborator

amitdo commented Jan 21, 2020

font size is supported:

#1173

@amitdo
Copy link
Collaborator

amitdo commented Jan 21, 2020

@pdiwadkar
Copy link

Is this issue still open?

@amitdo
Copy link
Collaborator

amitdo commented Jan 8, 2021

Is this issue still open?

#1074 (comment)

@shubham1206agra
Copy link

Can you provide some partial solution to this, like access only font size as I think there is support.
Please

@amitdo
Copy link
Collaborator

amitdo commented May 20, 2021

#1074 (comment)

@coco2121
Copy link

coco2121 commented Jun 7, 2021

Hello!
Is this issue still open?
I need to get some font properties from scanned pdf like when text is bold or underlined.
WordFontAttribute is returning None, any suggestion on what I can use to get these properties?

Thanks!

@kalai2033
Copy link

kalai2033 commented Nov 10, 2021

@coco2121 Hi, did you manage to find any solutions? I am also trying to solve exactly the same problem as yours?

@amitdo
Copy link
Collaborator

amitdo commented Nov 10, 2021

The LSTM engine does not support font attributes other than point size, and as I said 4 years ago, it won't support these attributes any time soon (It is not planned).

However, the legacy engine is still available in versions 4.x and 5.x and it supports these attributes. You need a model that includes data for the legacy engine and you need to use --oem 0 (It might also work with --oem 3, not sure).

@tesseract-ocr tesseract-ocr locked and limited conversation to collaborators Nov 10, 2021
@amitdo
Copy link
Collaborator

amitdo commented Nov 10, 2021

If you still have a question about this topic after reading my previous comment, please use our forum.

I locked this issue because people keep asking here the same questions and I answered the questions multiple times.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests