-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rotated texts in GT data set #28
Comments
A simple workaround could ignore all line images with an image height larger than the image width. Images with a small height should not be ignored (otherwise line numbers like |
Sounds challenging. An improved heuristic under the assumption of mostly correct transcriptions could estimate the line proportions from font metrics. But it would not work for nearly quadratic images. Just looked into the page xml of the above example
The
We can calculate the skew (and orientation) from it or just ocr the line image with tesseract:
This gives
Result of Tesseract on the rotated image (CER 0.0):
Now we can cut out the image of the most similar line and update Page-XML (keeping the semantics of |
Just for information: There are 273 |
@JKamlah, the latest update now has better baselines and bounding coordinates, but still no indicator whether some text is written vertically or otherwise rotated. I am not sure whether PAGE XML has a special indicator for rotated text or whether it only relies on the baseline information. Here is an example of a baseline for vertical text: What should we do with textlines without a baseline? Such textlines exist in the latest PAGE XML. |
@stweil there are two attributes for the TextRegionType in the PAGE XML format: orientation and readingOrientation. I will see if it is possible to add some information about the text rotation in Transkribus. At least for vertically stacked text the baseline points will not be sufficient without further information. There should be baseline information for every textline. We will fix this immediately. |
The baselines should now all be in place: 51fd52e |
A certain number of pages contains text written vertically. This is typically used in head rows of tables.
Example page: ONB_ibn_19110701_010.tif
The corresponding line boxes are not rotated, so also contain text written vertically. They are not suitable for training or evaluation. In addition the sample line image contains two text lines.
The text was updated successfully, but these errors were encountered: