Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD script to create a simplified version of hocr-files #152

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
74 changes: 74 additions & 0 deletions hocr-simplify
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/usr/bin/env python

# change level of typesetting and/or remove properties
# to create a simplified hocr-version

from __future__ import print_function
import argparse
import re
import sys
import os

from lxml import etree, html

parser = argparse.ArgumentParser(
description=('change level of typesetting and/or'
'remove properties to create'
'a simplified hocr-version'))
properties = ['baseline', 'bbox', 'cflow', 'cuts', 'hardbreak', 'image',
'imagemd5', 'lpageno', 'ppageno', 'nlp', 'order', 'poly',
'scan_res', 'textangle', 'x_booxes', 'x_font', 'x_fsize',
'x_confs', 'x_scanner', 'x_source', 'x_wconf']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have also an option to delete id and/or dir parameter, but they are on their own.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing attributes is now implemented

parser.add_argument('file', nargs='?', default=sys.stdin)
parser.add_argument('-t', '--typesetting', type=str,
choices=['glyph', 'word', 'line', 'par', 'carea', 'page'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the choice glyph doing anything for simplification? I haven't seen an hocr-example where there was an element inside a ocr-glyph.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought i would need them, to remove char choices, but i've implemented it in another place. So i removed the "glyph" typesetting option.

help='Maximum level of typesetting')
parser.add_argument('-r', '--remove-properties', nargs='+',
help='List of properties: {}'.format(','.join(properties)))
parser.add_argument('fileout', nargs='?',
help="Outputpath, default: print to terminal")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Outputpath/Output path/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also in the comment below.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved.

parser.add_argument('-v', '--verbose',
action='store_true', help='Verbose, default: %(default)s')

args = parser.parse_args()

doc = html.parse(args.file)
# change level of typesetting
if args.typesetting:
# set maximum level of typesetting
if args.typesetting in ["word"]:
args.typesetting = "ocrx_" + args.typesetting
else:
args.typesetting = "ocr_" + args.typesetting

# apply new level of typesetting
for node in doc.xpath("//*[@class='{}']".format(args.typesetting)):
if args.verbose:
print(re.sub(r'\s+', '\x20', node.text_content()).strip())
node.text = node.text_content().strip()
for child in list(node):
node.remove(child)

# remove properties
if args.remove_properties:
for node in doc.xpath("//*[@title]"):
title = node.get("title")
for prop in title.split(";"):
(key, args) = prop.strip().split(None, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use None here and not the white-space character to split key and value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair i've took this part from hocr-cut.

if key in args.remove_properties:
if args.verbose:
print("Replaced :{}".format(title))
title = title.replace(prop + ";", "").strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work when the property is the last one (no semi-colon then).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can also try something like this, which looks much shorter (code not yet tested):

title = node.get("title")
title = re.sub(r"\s?(%s)\s+[^;$];?*" % args.remove_properties.join("|"), "")

BTW don't you have to save it back in the doc somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we don't need to parse it in details, we just have to delete the parameters together with their values, which are not needed anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. If reworked this part, but without regexp. Also i had to replace the double quotation with single ones.

node.set('title', ';'.join([prop.replace("\"","'") for prop in title.split(";") if prop.strip().split(None, 1)[0] not in args.remove_properties]))


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to update the ocr-capabilities meta tag.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved.

# if no outputpath is given, print to terminal
if args.fileout is None:
print(etree.tostring(doc, pretty_print=True).decode('UTF-8'))
else:
# create output path if needed
if not os.path.isdir(os.path.dirname(args.fileout)):
os.makedirs(os.path.dirname(args.fileout))

# write new hocr-files
with open(args.fileout, "w") as f:
f.writelines(etree.tostring(doc, pretty_print=True).decode('UTF-8'))
11 changes: 11 additions & 0 deletions test/hocr-simplify/hocr-simplify.tsht
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env tsht
TESTDATA="../testdata"
SIMPLEFILE="./tess.simple.hocr"

plan 5
Copy link
Collaborator

@zuphilip zuphilip Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the number of test cases, i.e. should be 2 here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed Plan 5 to Plan 3. I added two more test case, with the new char choice options.


after () {
rm -f "$SIMPLEFILE"
}
hocr-simplify "$TESTDATA/tess.hocr" -t page > "$SIMPLEFILE" || fail 'hocr-simplify'
equals 3870 $(ls -l "$SIMPLEFILE" | cut -d " " -f5 ) 'filesize == 3870'
2 changes: 1 addition & 1 deletion test/smoke.tsht
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env tsht

for f in check combine eval eval-geom eval-lines extract-g1000 extract-images lines merge-dc pdf split;do
for f in check combine eval eval-geom eval-lines extract-g1000 extract-images lines merge-dc pdf split simplify;do
exec_ok "hocr-$f" "--help"
exec_ok "hocr-$f" "-h"
done