cisocrgroup · finkf · May 11, 2020 · May 9, 2020 · May 9, 2020 · May 9, 2020
diff --git a/README.md b/README.md
@@ -111,7 +111,7 @@ Arguments:
  * `--mets` path to METS file in the workspace
 
 ### ocrd-cis-ocropy-train
-The ocropy-train tool can be used to train LSTM models.
+The `ocropy-train` tool can be used to train LSTM models.
 It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages.
 Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.
 ```sh
@@ -122,8 +122,9 @@ ocrd-cis-ocropy-train \
 ```
 
 ### ocrd-cis-ocropy-clip
-The ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace.
-It runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).
+The `ocropy-clip` tool can be used to remove intrusions of neighbouring segments in regions / lines of a page.
+It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (as AlternativeImage).
+(Use this to suppress separators and neighbouring text.)
 ```sh
 ocrd-cis-ocropy-clip \
   --input-file-grp OCR-D-SEG-LINE \
@@ -133,8 +134,9 @@ ocrd-cis-ocropy-clip \
 ```
 
 ### ocrd-cis-ocropy-resegment
-The ocropy-resegment tool can be used to remove overlap between lines of a workspace.
-It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
+The `ocropy-resegment` tool can be used to remove overlap between neighbouring lines of a page.
+It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
+(Use this to polygonalise text lines poorly segmented, e.g. via bounding boxes.)
 ```sh
 ocrd-cis-ocropy-resegment \
   --input-file-grp OCR-D-SEG-LINE \
@@ -144,8 +146,9 @@ ocrd-cis-ocropy-resegment \
 ```
 
 ### ocrd-cis-ocropy-segment
-The ocropy-segment tool can be used to segment regions into lines.
-It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.
+The `ocropy-segment` tool can be used to segment (pages or) regions of a page into lines.
+It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE.
+(Does not detect tables or images.)
 ```sh
 ocrd-cis-ocropy-segment \
   --input-file-grp OCR-D-SEG-BLOCK \
@@ -155,8 +158,9 @@ ocrd-cis-ocropy-segment \
 ```
 
 ### ocrd-cis-ocropy-deskew
-The ocropy-deskew tool can be used to deskew pages / regions of a workspace.
-It runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
+The `ocropy-deskew` tool can be used to deskew pages / regions of a page.
+It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
+(Does not include orientation detection.)
 ```sh
 ocrd-cis-ocropy-deskew \
   --input-file-grp OCR-D-SEG-LINE \
@@ -166,8 +170,8 @@ ocrd-cis-ocropy-deskew \
 ```
 
 ### ocrd-cis-ocropy-denoise
-The ocropy-denoise tool can be used to despeckle pages / regions / lines of a workspace.
-It runs the Ocropy "nlbin" denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
+The `ocropy-denoise` tool can be used to despeckle pages / regions / lines of a page.
+It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
 ```sh
 ocrd-cis-ocropy-denoise \
   --input-file-grp OCR-D-SEG-LINE-DES \
@@ -177,8 +181,8 @@ ocrd-cis-ocropy-denoise \
 ```
 
 ### ocrd-cis-ocropy-binarize
-The ocropy-binarize tool can be used to binarize, denoise and deskew pages / regions / lines of a workspace.
-It runs the Ocropy "nlbin" adaptive thresholding, deskewing estimation and denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
+The `ocropy-binarize` tool can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
+It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
 ```sh
 ocrd-cis-ocropy-binarize \
   --input-file-grp OCR-D-SEG-LINE-DES \
@@ -188,8 +192,8 @@ ocrd-cis-ocropy-binarize \
 ```
 
 ### ocrd-cis-ocropy-dewarp
-The ocropy-dewarp tool can be used to dewarp text lines of a workspace.
-It runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
+The `ocropy-dewarp` tool can be used to dewarp text lines of a page.
+It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
 ```sh
 ocrd-cis-ocropy-dewarp \
   --input-file-grp OCR-D-SEG-LINE-BIN \
@@ -199,8 +203,8 @@ ocrd-cis-ocropy-dewarp \
 ```
 
 ### ocrd-cis-ocropy-recognize
-The ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace.
-It runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
+The `ocropy-recognize` tool can be used to recognize the lines / words / glyphs of a page.
+It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
 ```sh
 ocrd-cis-ocropy-recognize \
   --input-file-grp OCR-D-SEG-LINE-DEW \
@@ -232,18 +236,19 @@ place them into: /usr/share/tesseract-ocr/4.00/tessdata
 
 A decent pipeline might look like this:
 
+0. page-level binarization
 1. page-level cropping
-2. page-level binarization
+2. (page-level binarization)
 3. page-level deskewing
-4. page-level dewarping
+4. (page-level dewarping)
 5. region segmentation
 6. region-level clipping
-7. region-level deskewing
+7. (region-level deskewing)
 8. line segmentation
-9. line-level clipping or resegmentation
+9. (line-level clipping or resegmentation)
 10. line-level dewarping
 11. line-level recognition
-12. line-level alignment
+12. (line-level alignment and post-correction)
 
 If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
 

diff --git a/ocrd_cis/ocrd-tool.json b/ocrd_cis/ocrd-tool.json
@@ -1,6 +1,6 @@
 {
 	"git_url": "https://github.com/cisocrgroup/ocrd_cis",
-	"version": "0.0.6",
+	"version": "0.0.7",
 	"tools": {
 		"ocrd-cis-ocropy-binarize": {
 			"executable": "ocrd-cis-ocropy-binarize",
@@ -27,21 +27,29 @@
 				"method": {
 					"type": "string",
 					"enum": ["none", "global", "otsu", "gauss-otsu", "ocropy"],
-					"description": "binarization method to use (only ocropy will include deskewing)",
+					"description": "binarization method to use (only 'ocropy' will include deskewing and denoising)",
 					"default": "ocropy"
 				},
+				"threshold": {
+					"type": "number",
+					"format": "float",
+					"description": "for the 'ocropy' and ' global' method, black/white threshold to apply on the whitelevel normalized image (the larger the more/heavier foreground)",
+					"default": 0.5
+				},
 				"grayscale": {
 					"type": "boolean",
-					"description": "for the ocropy method, produce grayscale-normalized instead of thresholded image",
+					"description": "for the 'ocropy' method, produce grayscale-normalized instead of thresholded image",
 					"default": false
 				},
 				"maxskew": {
 					"type": "number",
-					"description": "modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)",
+					"format": "float",
+					"description": "modulus of maximum skewing angle (in degrees) to detect (larger will be slower, 0 will deactivate deskewing)",
 					"default": 0.0
 				},
 				"noise_maxsize": {
 					"type": "number",
+					"format": "int",
 					"description": "maximum pixel number for connected components to regard as noise (0 will deactivate denoising)",
 					"default": 0
 				},
@@ -297,47 +305,82 @@
 			"output_file_grp": [
 				"OCR-D-SEG-LINE"
 			],
-			"description": "Segment pages into regions or regions into lines with ocropy",
+			"description": "Segment pages into regions and lines, tables into cells and lines, or regions into lines with ocropy",
 			"parameters": {
 				"dpi": {
 					"type": "number",
 					"format": "float",
-					"description": "pixel density in dots per inch (overrides any meta-data in the images); disabled when negative",
+					"description": "pixel density in dots per inch (overrides any meta-data in the images); disabled when negative; when disabled and no meta-data is found, 300 is assumed",
 					"default": -1
 				},
 				"level-of-operation": {
 					"type": "string",
-					"enum": ["page", "region"],
-					"description": "PAGE XML hierarchy level to read images from",
+					"enum": ["page", "table", "region"],
+					"description": "PAGE XML hierarchy level to read images from and add elements to",
 					"default": "region"
 				},
 				"maxcolseps": {
 					"type": "number",
 					"format": "integer",
-					"default": 2,
-					"description": "number of white/background column separators to try (when operating on the page level)"
+					"default": 20,
+					"description": "(when operating on the page/table level) maximum number of white/background column separators to detect, counted piece-wise"
 				},
 				"maxseps": {
 					"type": "number",
 					"format": "integer",
-					"default": 5,
-					"description": "number of black/foreground column separators to try, counted individually as lines (when operating on the page level)"
+					"default": 20,
+					"description": "(when operating on the page/table level) number of black/foreground column separators to detect (and suppress), counted piece-wise"
+				},
+				"maximages": {
+					"type": "number",
+					"format": "integer",
+					"default": 10,
+					"description": "(when operating on the page level) maximum number of black/foreground very large components to detect (and suppress), counted piece-wise"
+				},
+				"csminheight": {
+					"type": "number",
+					"format": "integer",
+					"default": 4,
+					"description": "(when operating on the page/table level) minimum height of white/background or black/foreground column separators in multiples of scale/capheight, counted piece-wise"
+				},
+				"hlminwidth": {
+					"type": "number",
+					"format": "integer",
+					"default": 10,
+					"description": "(when operating on the page/table level) minimum width of black/foreground horizontal separators in multiples of scale/capheight, counted piece-wise"
+				},
+				"gap_height": {
+					"type": "number",
+					"format": "float",
+					"default": 0.01,
+					"description": "(when operating on the page/table level) largest minimum pixel average in the horizontal or vertical profiles (across the binarized image) to still be regarded as a gap during recursive X-Y cut from lines to regions; needs to be larger when more foreground noise is present, reduce to avoid mistaking text for noise"
+				},
+				"gap_width": {
+					"type": "number",
+					"format": "float",
+					"default": 1.5,
+					"description": "(when operating on the page/table level) smallest width in multiples of scale/capheight of a valley in the horizontal or vertical profiles (across the binarized image) to still be regarded as a gap during recursive X-Y cut from lines to regions; needs to be smaller when more foreground noise is present, increase to avoid mistaking inter-line as paragraph gaps and inter-word as inter-column gaps"
+				},
+				"overwrite_separators": {
+					"type": "boolean",
+					"default": true,
+					"description": "(when operating on the page/table level) remove any existing SeparatorRegion elements; otherwise append"
 				},
 				"overwrite_regions": {
 					"type": "boolean",
 					"default": true,
-					"description": "remove any existing TextRegion elements (when operating on the page level)"
+					"description": "(when operating on the page/table level) remove any existing TextRegion elements; otherwise append"
 				},
 				"overwrite_lines": {
 					"type": "boolean",
 					"default": true,
-					"description": "remove any existing TextLine elements (when operating on the region level)"
+					"description": "(when operating on the region level) remove any existing TextLine elements; otherwise append"
 				},
 				"spread": {
 					"type": "number",
 					"format": "float",
 					"default": 2.4,
-					"description": "distance in points (pt) from the foreground to project text line (or text region) labels into the background"
+					"description": "distance in points (pt) from the foreground to project text line (or text region) labels into the background for polygonal contours; if zero, project half a scale/capheight"
 				}
 			}
 		},

diff --git a/ocrd_cis/ocropy/binarize.py b/ocrd_cis/ocropy/binarize.py
@@ -33,28 +33,30 @@
 LOG = getLogger('processor.OcropyBinarize')
 FALLBACK_FILEGRP_IMG = 'OCR-D-IMG-BIN'
 
-def binarize(pil_image, method='ocropy', maxskew=2, nrm=False):
+def binarize(pil_image, method='ocropy', maxskew=2, threshold=0.5, nrm=False):
     LOG.debug('binarizing %dx%d image with method=%s', pil_image.width, pil_image.height, method)
     if method == 'none':
+        # useful if the images are already binary,
+        # but lack image attribute `binarized`
         return pil_image, 0
     elif method == 'ocropy':
         # parameter defaults from ocropy-nlbin:
         array = pil2array(pil_image)
-        bin, angle = common.binarize(array, maxskew=maxskew, nrm=nrm)
+        bin, angle = common.binarize(array, maxskew=maxskew, threshold=threshold, nrm=nrm)
         return array2pil(bin), angle
     # equivalent to ocropy, but without deskewing:
     # elif method == 'kraken':
     #     image = kraken.binarization.nlbin(pil_image)
     #     return image, 0
-    # FIXME: add 'sauvola'
+    # FIXME: add 'sauvola' from OLD/ocropus-sauvola
     else:
         # Convert RGB to OpenCV
         #img = cv2.cvtColor(np.asarray(pil_image), cv2.COLOR_RGB2GRAY)
         img = np.asarray(pil_image.convert('L'))
 
         if method == 'global':
             # global thresholding
-            _, th = cv2.threshold(img,127,255,cv2.THRESH_BINARY)
+            _, th = cv2.threshold(img,threshold*255,255,cv2.THRESH_BINARY)
         elif method == 'otsu':
             # Otsu's thresholding
             _, th = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
@@ -182,21 +184,20 @@ def process_page(self, page, page_image, page_xywh, page_id, file_id):
         if 'angle' in page_xywh and page_xywh['angle']:
             # orientation has already been annotated (by previous deskewing),
             # so skip deskewing here:
-            bin_image, _ = binarize(page_image,
+            maxskew = 0
+        else:
+            maxskew = self.parameter['maxskew']
+        bin_image, angle = binarize(page_image,
                                     method=self.parameter['method'],
-                                    maxskew=0,
+                                    maxskew=maxskew,
+                                    threshold=self.parameter['threshold'],
                                     nrm=self.parameter['grayscale'])
-        else:
-            bin_image, angle = binarize(page_image,
-                                        method=self.parameter['method'],
-                                        maxskew=self.parameter['maxskew'],
-                                        nrm=self.parameter['grayscale'])
-            if angle:
-                features += ',deskewed'
-            page_xywh['angle'] = angle
-        bin_image = remove_noise(bin_image,
-                                 maxsize=self.parameter['noise_maxsize'])
+        if angle:
+            features += ',deskewed'
+        page_xywh['angle'] = angle
         if self.parameter['noise_maxsize']:
+            bin_image = remove_noise(
+                bin_image, maxsize=self.parameter['noise_maxsize'])
             features += ',despeckled'
         # annotate angle in PAGE (to allow consumers of the AlternativeImage
         # to do consistent coordinate transforms, and non-consumers

diff --git a/ocrd_cis/ocropy/clip.py b/ocrd_cis/ocropy/clip.py
@@ -247,7 +247,7 @@ def process_segment(self, segment, segment_mask, segment_polygon, neighbours,
         for neighbour, neighbour_mask in neighbours:
             # find connected components that (only) belong to the neighbour:
             intruders = segment_mask * morph.keep_marked(parent_bin, neighbour_mask > 0) # overlaps neighbour
-            intruders -= morph.keep_marked(intruders, segment_mask - neighbour_mask > 0) # but exclusively
+            intruders = morph.remove_marked(intruders, segment_mask > neighbour_mask) # but exclusively
             num_intruders = np.count_nonzero(intruders)
             num_foreground = np.count_nonzero(segment_mask * parent_bin)
             if not num_intruders: