-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pagecontent example is broken #23
Comments
So it's a good real-world example then ;P |
Understood. (Perhaps you want to include that explanation in the schema's documentation.) But that still does not relieve you of avoiding self-intersection along the path (be it inside or center or above-left), does it?
Ligatures are involved, yes, but they are not the cause of the warning. The warning is warranted because the |
Self-intersecting is allowed to enable representing an object as above with a tightly fitting polygon. Otherwise you would have to make the polygon larger than the object. So if an object has a portion that is one pixel wide, we would like to be able to cover that with a polygon that is one pixel wide at that position. Of course other applications can be more strict. |
But you said earlier yourself (as a prelude to my PR introducing above quotes) that in PAGE-XML, paths must not self-intersect!?
That's true only under the path-refers-to-inside-polygon convention you described above, but not under the more natural (and much easier to implement) pixel-below-right or pixel-center conventions (which are pervasive across polygon/imaging libraries BTW). I would even argue that your explanation does not actually fit what Aletheia did here. I plotted the image along with the polygon contour (as IMHO what we see here is that:
|
Not sure about that example, but I'm sure about the interpretation Aletheia uses 🙂 |
It's not equivalent to either of the others. With pixel-below-right it has in common that it's easy to implement. With path-refers-to-inside-polygon it has in common that zero nominal area can still denote more than zero pixels.
The example is not important in itself, only for diagnostics. (And as I said, I can produce many more.) It is striking that whatever process/function produced polygons in But it does matter what interpretation PAGE-XML itself should have. If it really is to be path-refers-to-inside-polygon, then what about baselines and grid points? They have no inside or outside. And what about directionality (inner=left vs inner=right), would that not matter equally? |
Sorry, yes, they are functionally equivalent. So you are saying (functionally), Aletheia uses pixel-center coordinates? So for you (or Aletheia) the correct path for this polygon example would be |
And to get back to our diagnostic example: This is from Aletheia itself. My interpretation of this apparent error (notice how the red line misses the foreground by 1px to the bottom and right of the glyphs) is that maybe the annotated polygon has been calculated under the pixel-center convention, but the display uses pixel-below-right. |
When I run the OCR-D PAGE-XML validator on
pagecontent/examples/aletheiaexamplepage.xml
, activating--page-textequiv-consistency=strict
, I get this report:XML ordering
error report...
Looking in the XML, indeed the element ordering of those glyphs is wrong w.r.t. their textequiv ordering (and bounding boxes).
If I then look at
--check-coords
, I get tons of errors, here's one example of each type:coordinate self-intersection
error report...
This is too small to be visible (look for the word
to
at the bottom-left of the page), but it's true: there's a self-intersection atx=238
, because the polygon already has a liney=4542..4547
(from the implicit ring), soy=4543
(the second last point) hits it.This is measured and reported by Shapely.
Notably, most of those self-intersections go away when you
.simplify(1.0, preserve_topology=True)
, so they are tiny.→ How precise should we interpret PAGE's
PointsType
non-intersection requirement?PAGE-XML/pagecontent/schema/pagecontent.xsd
Line 479 in 9b4b3c0
Examples where self-intersection is still due to rounding only, but cannot be solved by just
.simplify()
:error report...
child not within parent
error report...
Again, this is too small to be visible (last word of the page), but this true as well: the word element is not properly contained in its parent textline.
In most cases though, we do get parent-child subordination when we keep a more slack rein via
.buffer(1.5)
, so these errors are also tiny.→ How precise should we interpret PAGE's
PointsType
parent closure requirement?PAGE-XML/pagecontent/schema/pagecontent.xsd
Line 474 in 9b4b3c0
Examples of where you cannot fix this with a small tolerance:
c761
(last letter inDocument
in title)w672
(word3.03
near the bottom of the page)@tboenig @kba I added these 2 workarounds to my current validator-related PR here.
The text was updated successfully, but these errors were encountered: