Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pagecontent example is broken #23

Open
bertsky opened this issue May 10, 2020 · 10 comments
Open

pagecontent example is broken #23

bertsky opened this issue May 10, 2020 · 10 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented May 10, 2020

When I run the OCR-D PAGE-XML validator on pagecontent/examples/aletheiaexamplepage.xml, activating --page-textequiv-consistency=strict, I get this report:

XML ordering

error report...

18:52:45.293 INFO ocrd.page_validator - Validating input file 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml'
<report valid="false">
  <error>INCONSISTENCY in Word ID 'w410' of file 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': text results 'Typical' != concatenated 'Typicla'</error>
  <error>INCONSISTENCY in Word ID 'w411' of file 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': text results 'Workflows' != concatenated 'Workflosw'</error>
  <error>INCONSISTENCY in Word ID 'w505' of file 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': text results 'PRImA' != concatenated 'PRIAm'</error>
</report>

Looking in the XML, indeed the element ordering of those glyphs is wrong w.r.t. their textequiv ordering (and bounding boxes).

If I then look at --check-coords, I get tons of errors, here's one example of each type:

coordinate self-intersection

error report...

<error>INVALIDITY in Word ID 'w598' of 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': coords '238,4547 256,4547 256,4548 257,4548 257,4549 258,4549 258,4550 259,4550 259,4552 260,4552 260,4559 259,4559 259,4561 258,4561 258,4562 257,4562 257,4563 255,4563 255,4564 238,4564 238,4563 237,4563 237,4549 236,4549 236,4548 233,4548 233,4547 236,4547 236,4545 237,4545 237,4543 238,4543 238,4542' - Self-intersection[238 4542]</error>

This is too small to be visible (look for the word to at the bottom-left of the page), but it's true: there's a self-intersection at x=238, because the polygon already has a line y=4542..4547 (from the implicit ring), so y=4543 (the second last point) hits it.

This is measured and reported by Shapely.

Notably, most of those self-intersections go away when you .simplify(1.0, preserve_topology=True), so they are tiny.

How precise should we interpret PAGE's PointsType non-intersection requirement?

Paths must be planar (i.e. must not self-intersect).

Examples where self-intersection is still due to rounding only, but cannot be solved by just .simplify():

error report...

  <error>INVALIDITY in TextRegion ID 'r31' of 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': coords '1578,3267 1651,3267 1651,3249 1679,3249 1679,3252 1777,3252 1777,3259 1940,3259 1940,3247 2008,3247 2008,3259 2171,3259 2171,3246 2186,3246 2186,3259 2372,3259 2372,3358 2073,3358 2073,3370 1957,3370 1957,3358 1678,3358 1678,3370 1653,3370 1653,3252 1651,3252 1651,3284 1578,3284' - Self-intersection[1651 3252]</error>
  <error>INVALIDITY in TextLine ID 'l158' of 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': coords '2186,3246 2186,3259 2342,3259 2342,3260 2345,3260 2345,3261 2347,3261 2347,3262 2348,3262 2348,3263 2349,3263 2349,3266 2350,3266 2350,3272 2372,3272 2372,3275 2371,3275 2371,3276 2351,3276 2351,3289 2341,3289 2341,3290 2301,3290 2301,3301 2300,3301 2300,3302 2215,3302 2215,3301 2214,3301 2214,3290 1956,3290 1956,3301 1955,3301 1955,3302 1953,3302 1953,3290 1685,3290 1685,3289 1663,3289 1663,3284 1653,3284 1653,3252 1651,3252 1651,3284 1584,3284 1584,3283 1582,3283 1582,3282 1581,3282 1581,3281 1580,3281 1580,3280 1579,3280 1579,3277 1578,3277 1578,3274 1579,3274 1579,3271 1580,3271 1580,3270 1581,3270 1581,3269 1582,3269 1582,3268 1584,3268 1584,3267 1651,3267 1651,3249 1679,3249 1679,3252 1777,3252 1777,3259 1940,3259 1940,3247 2008,3247 2008,3259 2183,3259 2183,3246' - Self-intersection[1651 3252]</error>
  <error>INVALIDITY in TextRegion ID 'r33' of 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': coords '107,3419 180,3419 180,3401 246,3401 246,3398 295,3398 295,3400 356,3400 356,3412 465,3412 465,3404 480,3404 480,3412 611,3412 611,3398 628,3398 628,3412 781,3412 781,3400 817,3400 817,3398 840,3398 840,3443 793,3443 793,3442 713,3442 713,3443 459,3443 459,3511 312,3511 312,3510 201,3510 201,3511 182,3511 182,3409 180,3409 180,3437 107,3437' - Self-intersection[180 3409]</error>
  <error>INVALIDITY in TextLine ID 'l110' of 'PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': coords '294,3398 294,3399 295,3399 295,3400 356,3400 356,3412 471,3412 471,3405 473,3405 473,3404 474,3404 474,3412 625,3412 625,3399 626,3399 626,3398 628,3398 628,3412 781,3412 781,3400 817,3400 817,3399 818,3399 818,3398 820,3398 820,3412 837,3412 837,3413 838,3413 838,3441 839,3441 839,3442 840,3442 801,3442 801,3443 793,3443 793,3442 713,3442 713,3443 222,3443 222,3442 195,3442 195,3441 194,3441 194,3438 193,3438 193,3437 182,3437 182,3409 180,3409 180,3437 114,3437 114,3436 111,3436 111,3435 110,3435 110,3434 109,3434 109,3433 108,3433 108,3431 107,3431 107,3425 108,3425 108,3423 109,3423 109,3422 110,3422 110,3421 112,3421 112,3420 114,3420 114,3419 180,3419 180,3401 246,3401 246,3398' - Self-intersection[180 3409]</error>

child not within parent

error report...

  <error>INCONSISTENCY in TextLine ID 'w719' of '../PAGE-XML/pagecontent/examples/aletheiaexamplepage.xml': coords '2899,4822 2899,4833 2992,4833 2992,4834 3003,4834 3003,4860 3002,4860 3002,4864 3001,4864 3001,4865 3000,4865 3000,4867 2999,4867 2999,4868 2997,4868 2997,4869 2994,4869 2994,4870 2987,4870 2987,4869 2984,4869 2984,4860 2628,4860 2628,4870 2626,4870 2626,4869 2625,4869 2625,4859 2500,4859 2500,4858 2499,4858 2499,4855 2498,4855 2498,4852 2497,4852 2497,4849 2496,4849 2496,4846 2495,4846 2495,4843 2494,4843 2494,4840 2493,4840 2493,4836 2492,4836 2492,4834 2670,4834 2670,4823 2673,4823 2673,4833 2772,4833 2772,4834 2896,4834 2896,4833 2897,4833 2897,4822' not within parent coords '1628,4822 1628,4824 1685,4824 1685,4825 1689,4825 1689,4826 1690,4826 1690,4827 1839,4827 1839,4825 1959,4825 1959,4822 1961,4822 1961,4827 2117,4827 2117,4825 2178,4825 2178,4823 2248,4823 2248,4822 2250,4822 2250,4823 2387,4823 2387,4822 2390,4822 2390,4834 2670,4834 2670,4823 2673,4823 2673,4833 2772,4833 2772,4834 2896,4834 2896,4833 2897,4833 2897,4822 2899,4822 2899,4833 2992,4833 2992,4834 3003,4834 3003,4860 3002,4860 3002,4864 3001,4864 3001,4865 3000,4865 3000,4867 2999,4867 2999,4868 2997,4868 2997,4869 2994,4869 2994,4870 2987,4870 2987,4869 2984,4869 2984,4860 2724,4860 2724,4859 2641,4859 2641,4860 2628,4860 2628,4870 2626,4870 2626,4869 2625,4869 2625,4859 2471,4859 2471,4861 2470,4861 2470,4863 2469,4863 2469,4866 2466,4866 2466,4860 2362,4860 2362,4861 2361,4861 2361,4864 2360,4864 2360,4865 2359,4865 2359,4867 2358,4867 2358,4868 2356,4868 2356,4869 2353,4869 2353,4870 2347,4870 2347,4869 2343,4869 2343,4859 2241,4859 2241,4860 2092,4860 2092,4863 2091,4863 2091,4865 2090,4865 2090,4866 2088,4866 2088,4860 1885,4860 1885,4859 1800,4859 1800,4860 1639,4860 1639,4863 1638,4863 1638,4866 1635,4866 1635,4860 1400,4860 1400,4861 1399,4861 1399,4864 1398,4864 1398,4866 1397,4866 1397,4867 1396,4867 1396,4869 1394,4869 1394,4870 1390,4870 1390,4860 1307,4860 1307,4859 1227,4859 1227,4860 1219,4860 1219,4859 1216,4859 1216,4858 1214,4858 1214,4857 1213,4857 1213,4856 1212,4856 1212,4855 1211,4855 1211,4854 1210,4854 1210,4851 1209,4851 1209,4825 1270,4825 1270,4823 1461,4823 1461,4822'</error>

Again, this is too small to be visible (last word of the page), but this true as well: the word element is not properly contained in its parent textline.

In most cases though, we do get parent-child subordination when we keep a more slack rein via .buffer(1.5), so these errors are also tiny.

How precise should we interpret PAGE's PointsType parent closure requirement?

No points may lie outside the outline of its parent,

Examples of where you cannot fix this with a small tolerance:

c761 (last letter in Document in title)
aletheiaexamplepage error c761

w672 (word 3.03 near the bottom of the page)
aletheiaexamplepage error w672

@tboenig @kba I added these 2 workarounds to my current validator-related PR here.

@chris1010010
Copy link
Contributor

So it's a good real-world example then ;P

@chris1010010
Copy link
Contributor

By the way, we interpret a rasterised polygon to be inclusive. That means the correct geometric polygon follows the outside of the pixels. That's why we don't filter out touching polygon lines, as long as they don't cross.
The text validation warning is related to ligatures. One of the glyphs is a ligature and has one character as text content. The parent word has the expanded version with two characters. Some users might want to use it that way.
image

@bertsky
Copy link
Contributor Author

bertsky commented May 11, 2020

By the way, we interpret a rasterised polygon to be inclusive. That means the correct geometric polygon follows the outside of the pixels. That's why we don't filter out touching polygon lines, as long as they don't cross.

Understood. (Perhaps you want to include that explanation in the schema's documentation.)

But that still does not relieve you of avoiding self-intersection along the path (be it inside or center or above-left), does it?

The text validation warning is related to ligatures. One of the glyphs is a ligature and has one character as text content. The parent word has the expanded version with two characters. Some users might want to use it that way.

Ligatures are involved, yes, but they are not the cause of the warning. The warning is warranted because the Glyph elements are not correctly ordered. (But the ligature might be the reason of why Aletheia produces this wrong order.) For example, within w410 'Typical', c843 'l' comes before c842 'a'. Concatenating the glyphs' TextEquiv sequence then does not yield the word's TextEquiv.

@chris1010010
Copy link
Contributor

Self-intersecting is allowed to enable representing an object as above with a tightly fitting polygon. Otherwise you would have to make the polygon larger than the object. So if an object has a portion that is one pixel wide, we would like to be able to cover that with a polygon that is one pixel wide at that position. Of course other applications can be more strict.

@bertsky
Copy link
Contributor Author

bertsky commented May 11, 2020

Self-intersecting is allowed to enable representing an object as above with a tightly fitting polygon.

But you said earlier yourself (as a prelude to my PR introducing above quotes) that in PAGE-XML, paths must not self-intersect!?

Otherwise you would have to make the polygon larger than the object.

That's true only under the path-refers-to-inside-polygon convention you described above, but not under the more natural (and much easier to implement) pixel-below-right or pixel-center conventions (which are pervasive across polygon/imaging libraries BTW).

I would even argue that your explanation does not actually fit what Aletheia did here. I plotted the image along with the polygon contour (as skimage.draw.polygon_perimeter would have it) from the XML:

aletheiaexamplepage error w598

IMHO what we see here is that:

  1. The annotated polygon does fit the pixel-below-right convention. (You can tell because scikit-image uses that convention, and most of the polygon lies inside the black foreground, and equally so on all sides.)
  2. The error reported by our validator has nothing to do with the question of coordinate conventions and the border-case of touching the same point again: 238,4547 ... ... 237,4545 237,4543 238,4543 238,4542, that's a cross. (It's the tip of the t, going up 2px, then right 1px, then up 1px, and then down 5px by ring convention.)

@chris1010010
Copy link
Contributor

Not sure about that example, but I'm sure about the interpretation Aletheia uses 🙂
What's natural is different to different people. The center-pixel approach is equivalent, isn't it? Below-right has its merit for sure

@bertsky
Copy link
Contributor Author

bertsky commented May 11, 2020

The center-pixel approach is equivalent, isn't it?

It's not equivalent to either of the others. With pixel-below-right it has in common that it's easy to implement. With path-refers-to-inside-polygon it has in common that zero nominal area can still denote more than zero pixels.

Not sure about that example, but I'm sure about the interpretation Aletheia uses

The example is not important in itself, only for diagnostics. (And as I said, I can produce many more.) It is striking that whatever process/function produced polygons in aletheiaexamplepage.xml gave a pixel-accurate hull under the pixel-below-right but not the path-refers-to-inside-polygon convention (which would show as yellow gloss right-of/under the o).

But it does matter what interpretation PAGE-XML itself should have. If it really is to be path-refers-to-inside-polygon, then what about baselines and grid points? They have no inside or outside. And what about directionality (inner=left vs inner=right), would that not matter equally?

@chris1010010
Copy link
Contributor

I just meant they are equivalent in terms of coordinates and which pixels are included.
In out tools, when a polygon is created from pixel-component, the polygon follows the outside pixel coordinates.
Unfortunately I don't have much time at the moment and can't look into this in more detail.
image

@bertsky
Copy link
Contributor Author

bertsky commented May 12, 2020

I just meant they are equivalent in terms of coordinates and which pixels are included.

Sorry, yes, they are functionally equivalent. So you are saying (functionally), Aletheia uses pixel-center coordinates?

So for you (or Aletheia) the correct path for this polygon example would be 0,0 3,0 3,2 0,2 0,4 1,4 1,5 0,5, but for me (or skimage/scipy.ndimage/OpenCV/...) the correct path for this example would be 0,0 4,0 4,3 1,3 1,4 2,4 2,6 0,6 – right?

@bertsky
Copy link
Contributor Author

bertsky commented May 12, 2020

And to get back to our diagnostic example:

w598-aletheia-contour-shot

This is from Aletheia itself. My interpretation of this apparent error (notice how the red line misses the foreground by 1px to the bottom and right of the glyphs) is that maybe the annotated polygon has been calculated under the pixel-center convention, but the display uses pixel-below-right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants