Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1c6096b
deskew: boost performance by rotating via PIL instead of scipy
May 9, 2020
ca4d831
segment/resegment: require images to be binarized already...
May 9, 2020
74710db
segment/resegment: avoid (rare) invalid coordinates...
May 9, 2020
4deb46a
segment: line-segment in table regions, too...
May 9, 2020
8f38480
ocrolib: make basic scale estimation zoomable (DPI-relative)!
May 9, 2020
9a17bc8
ocrolib: minor improvements...
May 9, 2020
30665ac
common: major segmentation improvements...
May 9, 2020
954973c
segment: adapt to changes and...
May 9, 2020
24e3ead
resegment: adapt to changes and...
May 9, 2020
faad379
binarize: expose 'threshold' parameter
May 9, 2020
3bb1113
ocrolib: fix loading uncompressed model (Py2/3)
May 9, 2020
f242c51
segment: incremental annotation...
May 9, 2020
dfd0138
common/segment: avoid spreading into sepmask...
May 9, 2020
26cb3a1
segment: add reading order...
May 9, 2020
dc55bee
segment: separate level-of-operation=page...
May 9, 2020
bbf1c29
ocrolib: fallback heuristic for basic scale estimation
May 9, 2020
b928b71
ocrolib: fix math-morphology operations...
May 9, 2020
f756167
recognize: fix regression from 48a89e92
May 9, 2020
90257da
ocrolib.morph: replace SciPy with faster OpenCV morphology/component …
May 9, 2020
625a47a
ocrolib.morph: add utility for performance comparisons
May 9, 2020
0fe7374
ocrolib: add proper reading_order function...
May 9, 2020
1d85fca
common: further improve segmentation (follow-up on f49c20f)...
May 9, 2020
95eff26
common: rewrite of lines2regions via recursive X-Y cut...
May 9, 2020
9b82980
segment: adapt to changes in common and...
May 9, 2020
dddf5cc
segment: add an AlternativeImage clipping non-text to bg...
May 9, 2020
f242984
re/segment: fix polygons (keep detected polygon paths _open_)
May 9, 2020
32786a6
make LGTM checker happy
May 10, 2020
b505d65
segment: don't try to add if no reading order group exists
May 10, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 27 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Arguments:
* `--mets` path to METS file in the workspace

### ocrd-cis-ocropy-train
The ocropy-train tool can be used to train LSTM models.
The `ocropy-train` tool can be used to train LSTM models.
It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages.
Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.
```sh
Expand All @@ -122,8 +122,9 @@ ocrd-cis-ocropy-train \
```

### ocrd-cis-ocropy-clip
The ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace.
It runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).
The `ocropy-clip` tool can be used to remove intrusions of neighbouring segments in regions / lines of a page.
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (as AlternativeImage).
(Use this to suppress separators and neighbouring text.)
```sh
ocrd-cis-ocropy-clip \
--input-file-grp OCR-D-SEG-LINE \
Expand All @@ -133,8 +134,9 @@ ocrd-cis-ocropy-clip \
```

### ocrd-cis-ocropy-resegment
The ocropy-resegment tool can be used to remove overlap between lines of a workspace.
It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
The `ocropy-resegment` tool can be used to remove overlap between neighbouring lines of a page.
It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
(Use this to polygonalise text lines poorly segmented, e.g. via bounding boxes.)
```sh
ocrd-cis-ocropy-resegment \
--input-file-grp OCR-D-SEG-LINE \
Expand All @@ -144,8 +146,9 @@ ocrd-cis-ocropy-resegment \
```

### ocrd-cis-ocropy-segment
The ocropy-segment tool can be used to segment regions into lines.
It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.
The `ocropy-segment` tool can be used to segment (pages or) regions of a page into lines.
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE.
(Does not detect tables or images.)
```sh
ocrd-cis-ocropy-segment \
--input-file-grp OCR-D-SEG-BLOCK \
Expand All @@ -155,8 +158,9 @@ ocrd-cis-ocropy-segment \
```

### ocrd-cis-ocropy-deskew
The ocropy-deskew tool can be used to deskew pages / regions of a workspace.
It runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
The `ocropy-deskew` tool can be used to deskew pages / regions of a page.
It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
(Does not include orientation detection.)
```sh
ocrd-cis-ocropy-deskew \
--input-file-grp OCR-D-SEG-LINE \
Expand All @@ -166,8 +170,8 @@ ocrd-cis-ocropy-deskew \
```

### ocrd-cis-ocropy-denoise
The ocropy-denoise tool can be used to despeckle pages / regions / lines of a workspace.
It runs the Ocropy "nlbin" denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
The `ocropy-denoise` tool can be used to despeckle pages / regions / lines of a page.
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
```sh
ocrd-cis-ocropy-denoise \
--input-file-grp OCR-D-SEG-LINE-DES \
Expand All @@ -177,8 +181,8 @@ ocrd-cis-ocropy-denoise \
```

### ocrd-cis-ocropy-binarize
The ocropy-binarize tool can be used to binarize, denoise and deskew pages / regions / lines of a workspace.
It runs the Ocropy "nlbin" adaptive thresholding, deskewing estimation and denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
The `ocropy-binarize` tool can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
```sh
ocrd-cis-ocropy-binarize \
--input-file-grp OCR-D-SEG-LINE-DES \
Expand All @@ -188,8 +192,8 @@ ocrd-cis-ocropy-binarize \
```

### ocrd-cis-ocropy-dewarp
The ocropy-dewarp tool can be used to dewarp text lines of a workspace.
It runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
The `ocropy-dewarp` tool can be used to dewarp text lines of a page.
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
```sh
ocrd-cis-ocropy-dewarp \
--input-file-grp OCR-D-SEG-LINE-BIN \
Expand All @@ -199,8 +203,8 @@ ocrd-cis-ocropy-dewarp \
```

### ocrd-cis-ocropy-recognize
The ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace.
It runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
The `ocropy-recognize` tool can be used to recognize the lines / words / glyphs of a page.
It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
```sh
ocrd-cis-ocropy-recognize \
--input-file-grp OCR-D-SEG-LINE-DEW \
Expand Down Expand Up @@ -232,18 +236,19 @@ place them into: /usr/share/tesseract-ocr/4.00/tessdata

A decent pipeline might look like this:

0. page-level binarization
1. page-level cropping
2. page-level binarization
2. (page-level binarization)
3. page-level deskewing
4. page-level dewarping
4. (page-level dewarping)
5. region segmentation
6. region-level clipping
7. region-level deskewing
7. (region-level deskewing)
8. line segmentation
9. line-level clipping or resegmentation
9. (line-level clipping or resegmentation)
10. line-level dewarping
11. line-level recognition
12. line-level alignment
12. (line-level alignment and post-correction)

If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.

Expand Down
73 changes: 58 additions & 15 deletions ocrd_cis/ocrd-tool.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"git_url": "https://github.com/cisocrgroup/ocrd_cis",
"version": "0.0.6",
"version": "0.0.7",
"tools": {
"ocrd-cis-ocropy-binarize": {
"executable": "ocrd-cis-ocropy-binarize",
Expand All @@ -27,21 +27,29 @@
"method": {
"type": "string",
"enum": ["none", "global", "otsu", "gauss-otsu", "ocropy"],
"description": "binarization method to use (only ocropy will include deskewing)",
"description": "binarization method to use (only 'ocropy' will include deskewing and denoising)",
"default": "ocropy"
},
"threshold": {
"type": "number",
"format": "float",
"description": "for the 'ocropy' and ' global' method, black/white threshold to apply on the whitelevel normalized image (the larger the more/heavier foreground)",
"default": 0.5
},
"grayscale": {
"type": "boolean",
"description": "for the ocropy method, produce grayscale-normalized instead of thresholded image",
"description": "for the 'ocropy' method, produce grayscale-normalized instead of thresholded image",
"default": false
},
"maxskew": {
"type": "number",
"description": "modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)",
"format": "float",
"description": "modulus of maximum skewing angle (in degrees) to detect (larger will be slower, 0 will deactivate deskewing)",
"default": 0.0
},
"noise_maxsize": {
"type": "number",
"format": "int",
"description": "maximum pixel number for connected components to regard as noise (0 will deactivate denoising)",
"default": 0
},
Expand Down Expand Up @@ -297,47 +305,82 @@
"output_file_grp": [
"OCR-D-SEG-LINE"
],
"description": "Segment pages into regions or regions into lines with ocropy",
"description": "Segment pages into regions and lines, tables into cells and lines, or regions into lines with ocropy",
"parameters": {
"dpi": {
"type": "number",
"format": "float",
"description": "pixel density in dots per inch (overrides any meta-data in the images); disabled when negative",
"description": "pixel density in dots per inch (overrides any meta-data in the images); disabled when negative; when disabled and no meta-data is found, 300 is assumed",
"default": -1
},
"level-of-operation": {
"type": "string",
"enum": ["page", "region"],
"description": "PAGE XML hierarchy level to read images from",
"enum": ["page", "table", "region"],
"description": "PAGE XML hierarchy level to read images from and add elements to",
"default": "region"
},
"maxcolseps": {
"type": "number",
"format": "integer",
"default": 2,
"description": "number of white/background column separators to try (when operating on the page level)"
"default": 20,
"description": "(when operating on the page/table level) maximum number of white/background column separators to detect, counted piece-wise"
},
"maxseps": {
"type": "number",
"format": "integer",
"default": 5,
"description": "number of black/foreground column separators to try, counted individually as lines (when operating on the page level)"
"default": 20,
"description": "(when operating on the page/table level) number of black/foreground column separators to detect (and suppress), counted piece-wise"
},
"maximages": {
"type": "number",
"format": "integer",
"default": 10,
"description": "(when operating on the page level) maximum number of black/foreground very large components to detect (and suppress), counted piece-wise"
},
"csminheight": {
"type": "number",
"format": "integer",
"default": 4,
"description": "(when operating on the page/table level) minimum height of white/background or black/foreground column separators in multiples of scale/capheight, counted piece-wise"
},
"hlminwidth": {
"type": "number",
"format": "integer",
"default": 10,
"description": "(when operating on the page/table level) minimum width of black/foreground horizontal separators in multiples of scale/capheight, counted piece-wise"
},
"gap_height": {
"type": "number",
"format": "float",
"default": 0.01,
"description": "(when operating on the page/table level) largest minimum pixel average in the horizontal or vertical profiles (across the binarized image) to still be regarded as a gap during recursive X-Y cut from lines to regions; needs to be larger when more foreground noise is present, reduce to avoid mistaking text for noise"
},
"gap_width": {
"type": "number",
"format": "float",
"default": 1.5,
"description": "(when operating on the page/table level) smallest width in multiples of scale/capheight of a valley in the horizontal or vertical profiles (across the binarized image) to still be regarded as a gap during recursive X-Y cut from lines to regions; needs to be smaller when more foreground noise is present, increase to avoid mistaking inter-line as paragraph gaps and inter-word as inter-column gaps"
},
"overwrite_separators": {
"type": "boolean",
"default": true,
"description": "(when operating on the page/table level) remove any existing SeparatorRegion elements; otherwise append"
},
"overwrite_regions": {
"type": "boolean",
"default": true,
"description": "remove any existing TextRegion elements (when operating on the page level)"
"description": "(when operating on the page/table level) remove any existing TextRegion elements; otherwise append"
},
"overwrite_lines": {
"type": "boolean",
"default": true,
"description": "remove any existing TextLine elements (when operating on the region level)"
"description": "(when operating on the region level) remove any existing TextLine elements; otherwise append"
},
"spread": {
"type": "number",
"format": "float",
"default": 2.4,
"description": "distance in points (pt) from the foreground to project text line (or text region) labels into the background"
"description": "distance in points (pt) from the foreground to project text line (or text region) labels into the background for polygonal contours; if zero, project half a scale/capheight"
}
}
},
Expand Down
33 changes: 17 additions & 16 deletions ocrd_cis/ocropy/binarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,28 +33,30 @@
LOG = getLogger('processor.OcropyBinarize')
FALLBACK_FILEGRP_IMG = 'OCR-D-IMG-BIN'

def binarize(pil_image, method='ocropy', maxskew=2, nrm=False):
def binarize(pil_image, method='ocropy', maxskew=2, threshold=0.5, nrm=False):
LOG.debug('binarizing %dx%d image with method=%s', pil_image.width, pil_image.height, method)
if method == 'none':
# useful if the images are already binary,
# but lack image attribute `binarized`
return pil_image, 0
elif method == 'ocropy':
# parameter defaults from ocropy-nlbin:
array = pil2array(pil_image)
bin, angle = common.binarize(array, maxskew=maxskew, nrm=nrm)
bin, angle = common.binarize(array, maxskew=maxskew, threshold=threshold, nrm=nrm)
return array2pil(bin), angle
# equivalent to ocropy, but without deskewing:
# elif method == 'kraken':
# image = kraken.binarization.nlbin(pil_image)
# return image, 0
# FIXME: add 'sauvola'
# FIXME: add 'sauvola' from OLD/ocropus-sauvola
else:
# Convert RGB to OpenCV
#img = cv2.cvtColor(np.asarray(pil_image), cv2.COLOR_RGB2GRAY)
img = np.asarray(pil_image.convert('L'))

if method == 'global':
# global thresholding
_, th = cv2.threshold(img,127,255,cv2.THRESH_BINARY)
_, th = cv2.threshold(img,threshold*255,255,cv2.THRESH_BINARY)
elif method == 'otsu':
# Otsu's thresholding
_, th = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
Expand Down Expand Up @@ -182,21 +184,20 @@ def process_page(self, page, page_image, page_xywh, page_id, file_id):
if 'angle' in page_xywh and page_xywh['angle']:
# orientation has already been annotated (by previous deskewing),
# so skip deskewing here:
bin_image, _ = binarize(page_image,
maxskew = 0
else:
maxskew = self.parameter['maxskew']
bin_image, angle = binarize(page_image,
method=self.parameter['method'],
maxskew=0,
maxskew=maxskew,
threshold=self.parameter['threshold'],
nrm=self.parameter['grayscale'])
else:
bin_image, angle = binarize(page_image,
method=self.parameter['method'],
maxskew=self.parameter['maxskew'],
nrm=self.parameter['grayscale'])
if angle:
features += ',deskewed'
page_xywh['angle'] = angle
bin_image = remove_noise(bin_image,
maxsize=self.parameter['noise_maxsize'])
if angle:
features += ',deskewed'
page_xywh['angle'] = angle
if self.parameter['noise_maxsize']:
bin_image = remove_noise(
bin_image, maxsize=self.parameter['noise_maxsize'])
features += ',despeckled'
# annotate angle in PAGE (to allow consumers of the AlternativeImage
# to do consistent coordinate transforms, and non-consumers
Expand Down
2 changes: 1 addition & 1 deletion ocrd_cis/ocropy/clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ def process_segment(self, segment, segment_mask, segment_polygon, neighbours,
for neighbour, neighbour_mask in neighbours:
# find connected components that (only) belong to the neighbour:
intruders = segment_mask * morph.keep_marked(parent_bin, neighbour_mask > 0) # overlaps neighbour
intruders -= morph.keep_marked(intruders, segment_mask - neighbour_mask > 0) # but exclusively
intruders = morph.remove_marked(intruders, segment_mask > neighbour_mask) # but exclusively
num_intruders = np.count_nonzero(intruders)
num_foreground = np.count_nonzero(segment_mask * parent_bin)
if not num_intruders:
Expand Down
Loading