Skip to content

Conversation

@bertsky
Copy link
Collaborator

@bertsky bertsky commented Oct 28, 2020

Fixes a bug when the input lines are already empty.

I'll shortly append more fixes in the follow-up:

For example, I found that resegment does not cope well with large overlaps. The current algorithm simply fetches the largest line (optimising locally), without checking the neighbouring choices (optimising globally). See:

  • input: resegment-in
  • output: resegment-out

@wrznr
Copy link

wrznr commented Oct 30, 2020

Good catch. What does the output look like with your fix?

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 30, 2020

Good catch. What does the output look like with your fix?

No fix yet. I just avoid the combination tesserocr-segment-line + cis-ocropy-resegment in favour of direct cis-ocropy-segment (which works reasonably well even on the above warped lines).

Robert Sachunsky added 5 commits November 5, 2020 01:15
(Do not ignore the foreground of neighbouring regions/segments
 during page or line segmentation if they actually cover all of
 the image. This can happen when bad previous segmentation made
 neighbours overlap heavily.)
(Do not ignore the foreground of neighbouring regions
 during line segmentation when already using a clipped
 derived image for the region. Clipping applies to
 foreground components selectively, while suppression
 would mask out the whole segment's outline.)
@lgtm-com
Copy link

lgtm-com bot commented Nov 5, 2020

This pull request introduces 1 alert when merging 0549577 into c3fad1a - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 5, 2020

I am thinking about removing the hmerge_line_seeds functionality for region-level segmentation. As it stands, it is buggy, and scipy.ndimage.measurements.center_of_mass sometimes (depending on random seed!) fails with FloatingPointError (instead of silent inf or nan degradation). Also, it's not strictly needed to have text lines stretch across regions horizontally. IMO it would be ok to have multiple text lines right of each other when there are large gaps. This is sometimes over-segmentation of the benign kind ("allowable split", not affecting reading order), sometimes even correct.

The only place where that would matter is resegmentation – we need to decide which Ocropy text line to keep per input text line (and cannot keep multiple in the current scheme). But as stated in the opening example, I'll have to rewrite resegmentation anyway...

Robert Sachunsky added 3 commits November 6, 2020 07:44
(During line segmentation, when separators/neighbours of
 the text region are being suppressed, annotate a clipped
 derived image for that region, too. This is analogous to
 page segmentation, which already annotates a derived image
 with non-text suppressed.)
(When suppressing intruders from neighbours, make sure they
 are not suppressed in those neighbours as well.)
(When clipping intruders from neighbours, allow suppressing
 the foreground completely – without thresholding –, but
 avoid clipping if the segment's outline is properly contained
 in the neighbour – cannot have independent foreground at all.)
@lgtm-com
Copy link

lgtm-com bot commented Nov 6, 2020

This pull request introduces 1 alert when merging b871e8f into c3fad1a - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Nov 6, 2020

This pull request introduces 1 alert when merging 6714823 into c3fad1a - view on LGTM.com

new alerts:

  • 1 for Unused local variable

- when including existing segments in the recursive XY-cut
  (lines2regions) ordering/segmentation, avoid grouping
  them together with actual new text lines, and treat them
  like existing region assignments properly
- simplify decoding into PAGE regions
- when polygonalizing new regions from label masks, avoid
  creating hulls if these would create additional overlaps
  with existing regions
@bertsky
Copy link
Collaborator Author

bertsky commented Nov 10, 2020

I am thinking about removing the hmerge_line_seeds functionality for region-level segmentation. As it stands, it is buggy, and scipy.ndimage.measurements.center_of_mass sometimes (depending on random seed!) fails with FloatingPointError (instead of silent inf or nan degradation). Also, it's not strictly needed to have text lines stretch across regions horizontally.

Decided against that, because horizontal over-segmentation also introduces extra newlines. Improved the hmerge functionality instead.

@lgtm-com
Copy link

lgtm-com bot commented Jul 9, 2021

This pull request fixes 1 alert when merging 1a33f94 into c3fad1a - view on LGTM.com

fixed alerts:

  • 1 for Nested loops with same variable

@lgtm-com
Copy link

lgtm-com bot commented Jul 15, 2021

This pull request fixes 1 alert when merging 6f8a612 into c3fad1a - view on LGTM.com

fixed alerts:

  • 1 for Nested loops with same variable

Robert Sachunsky added 7 commits February 2, 2022 13:43
in `finalize`, if predefined region labels are present, when re-ordering
the slice's old and new zones and assigning textlines to them,
- calculate the order based on fg relationships, not bg
- make sure textlines are assigned to their majority zone
when merging textlines within text regions horizontally,
- do not only respect existing regions and fg separators
  (in blocking merges), but also bg separators
- enlarge the region mask to the newly merged line bg
when trying to partition slices by separators,
- also treat pre-existing regions like separators, and
- fix the condition on smallest allowed partitions
  (insignificant but complete lines)
when no cut or separator-split partition can be found for
the current slice, then attempt to find another separator-split
by grouping lines along their mutual horizontal neighbourship
with fg separators;
repeatedly allow both kinds of partitioning, if interspersed
@lgtm-com
Copy link

lgtm-com bot commented Feb 2, 2022

This pull request introduces 2 alerts and fixes 1 when merging 529f7f5 into c3fad1a - view on LGTM.com

new alerts:

  • 2 for Unused local variable

fixed alerts:

  • 1 for Nested loops with same variable

Robert Sachunsky added 9 commits February 24, 2022 15:48
- calculate connected component analysis
- calculate distance transform of existing labels
- find new line seeds by flattening existing labels
  (via maximum distance)
- propagate line seeds across connected components
  (by majority in case of conflict)
- spread ccomps labels against each other into background
- for each line,
  * if enough background and foreground wille be retained
  * find the hull polygon of the new line via alpha shape
  * annotate as new coordinates
- calculate connected component analysis
- find new line seeds based on the existing baselines
  (by applying dilation above)
- propagate line seeds across connected components
  (by majority in case of conflict)
- spread ccomps labels against each other into the background
- for each line,
  * if enough background and foreground will be retained
  * find the hull polygon of the new line via alpha shape
  * annotate as new coordinates
@lgtm-com
Copy link

lgtm-com bot commented Mar 4, 2022

This pull request introduces 4 alerts and fixes 1 when merging 14e27e1 into c3fad1a - view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 2 for Unused import

fixed alerts:

  • 1 for Nested loops with same variable

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 4, 2022

@finkf this is ready for merging (and releasing) IMO.

If you want me to do it myself, just give me a cue.

@lgtm-com
Copy link

lgtm-com bot commented Mar 4, 2022

This pull request introduces 2 alerts and fixes 1 when merging c2f9203 into c3fad1a - view on LGTM.com

new alerts:

  • 2 for Unused import

fixed alerts:

  • 1 for Nested loops with same variable

@finkf finkf merged commit a30ce3b into cisocrgroup:master Mar 4, 2022
@bertsky bertsky mentioned this pull request Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants