Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotated texts in GT data set #28

Open
stweil opened this issue Aug 28, 2020 · 6 comments
Open

Rotated texts in GT data set #28

stweil opened this issue Aug 28, 2020 · 6 comments

Comments

@stweil
Copy link
Member

stweil commented Aug 28, 2020

A certain number of pages contains text written vertically. This is typically used in head rows of tables.

Example page: ONB_ibn_19110701_010.tif

The corresponding line boxes are not rotated, so also contain text written vertically. They are not suitable for training or evaluation. In addition the sample line image contains two text lines.

@stweil
Copy link
Member Author

stweil commented Aug 28, 2020

A simple workaround could ignore all line images with an image height larger than the image width. Images with a small height should not be ignored (otherwise line numbers like I or 1 might not be trained).

@wollmers
Copy link
Contributor

wollmers commented May 29, 2021

Sounds challenging.

An improved heuristic under the assumption of mostly correct transcriptions could estimate the line proportions from font metrics. But it would not work for nearly quadratic images.

Just looked into the page xml of the above example ONB_ibn_19110701_010.xml:

      <TextLine id="line_1547100913156_36" custom="readingOrder {index:2;}">
        <Coords points="2307,2829 2320,2628 2370,2631 2357,2832"/>
        <Baseline points="2352,2832 2365,2631"/>
        <TextEquiv>
          <Unicode>Celsiusgraden</Unicode>
        </TextEquiv>
      </TextLine>

The Baseline tells us:

x1 - x2 = 2352 - 2365 = -13
y1 - y2 = 2832 - 2631 = 201

We can calculate the skew (and orientation) from it or just ocr the line image with tesseract:

    <p class='ocr_par' id='par_1_1' lang='ubma/frak2021_0.905_1587027_9141630' title="bbox 0 5 62 201">
     <span class='ocr_line' id='line_1_1' title="bbox 27 5 62 201; baseline -65.333 522.667; x_size 34; x_descenders 8; x_ascenders 8">
      <span class='ocrx_word' id='word_1_1' title='bbox 27 5 62 201; x_wconf 0'>vabsnni</span>
     </span>
     <span class='ocr_line' id='line_1_2' title="bbox 0 52 20 142; baseline -45 0; x_size 35.5; x_descenders 8.5; x_ascenders 8.5">
      <span class='ocrx_word' id='word_1_2' title='bbox 2 52 20 79; x_wconf 40'>11</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 0 100 19 142; x_wconf 27'>an!</span>
     </span>
    </p>

This gives

baseline -65.333  => -89.1230877852261 ~ -90 degrees clockwise
baseline -45      => -88.7269699799433 ~ -90 degrees clockwise

Result of Tesseract on the rotated image (CER 0.0):

    <p class='ocr_par' id='par_1_1' lang='ubma/frak2021_0.905_1587027_9141630' title="bbox 3 0 199 62">
     <span class='ocr_line' id='line_1_1' title="bbox 62 0 152 20; baseline 0.011 -1; x_size 35.5; x_descenders 8.5; x_ascenders 8.5">
      <span class='ocrx_word' id='word_1_1' title='bbox 62 0 104 19; x_wconf 34'>ur</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 125 2 152 20; x_wconf 82'>in</span>
     </span>
     <span class='ocr_line' id='line_1_2' title="bbox 3 27 199 62; baseline 0.015 -9; x_size 34; x_descenders 8; x_ascenders 8">
      <span class='ocrx_word' id='word_1_3' title='bbox 3 27 199 62; x_wconf 86'>Celſiusgraden</span>
     </span>
    </p>

Now we can cut out the image of the most similar line and update Page-XML (keeping the semantics of <Baseline points="2352,2832 2365,2631" />).

@wollmers
Copy link
Contributor

Just for information:

There are 273 TextLine entries in the XML files, where the skew of the Baseline is larger than (+/-) 10 degrees. Some of them have already rotated line images. Some of the skews are not a multiple of 90 degrees, mainly in advertisements

@stweil
Copy link
Member Author

stweil commented Apr 30, 2023

@JKamlah, the latest update now has better baselines and bounding coordinates, but still no indicator whether some text is written vertically or otherwise rotated. I am not sure whether PAGE XML has a special indicator for rotated text or whether it only relies on the baseline information. Here is an example of a baseline for vertical text: <Baseline points="2761,4416 2764,4078 2761,3869"/>. Maybe that baseline could be simplified by removing the 2nd point.

What should we do with textlines without a baseline? Such textlines exist in the latest PAGE XML.

@JKamlah
Copy link
Member

JKamlah commented May 2, 2023

@stweil there are two attributes for the TextRegionType in the PAGE XML format: orientation and readingOrientation. I will see if it is possible to add some information about the text rotation in Transkribus. At least for vertically stacked text the baseline points will not be sufficient without further information.

There should be baseline information for every textline. We will fix this immediately.

@JKamlah
Copy link
Member

JKamlah commented May 3, 2023

The baselines should now all be in place: 51fd52e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants