-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD script to create a simplified version of hocr-files #152
base: master
Are you sure you want to change the base?
Changes from 5 commits
ba74b3e
7385e5a
4f0a271
9160877
50f4855
e264c2f
be4bb77
4fb6a4c
95fa53f
009d746
a762023
4c44dea
cbc78fa
a050fbc
6a4e4ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
#!/usr/bin/env python | ||
|
||
# change level of typesetting and/or remove properties | ||
# to create a simplified hocr-version | ||
|
||
from __future__ import print_function | ||
import argparse | ||
import re | ||
import sys | ||
import os | ||
|
||
from lxml import etree, html | ||
|
||
parser = argparse.ArgumentParser( | ||
description=('change level of typesetting and/or' | ||
'remove properties to create' | ||
'a simplified hocr-version')) | ||
properties = ['baseline', 'bbox', 'cflow', 'cuts', 'hardbreak', 'image', | ||
'imagemd5', 'lpageno', 'ppageno', 'nlp', 'order', 'poly', | ||
'scan_res', 'textangle', 'x_booxes', 'x_font', 'x_fsize', | ||
'x_confs', 'x_scanner', 'x_source', 'x_wconf'] | ||
|
||
parser.add_argument('file', nargs='?', default=sys.stdin) | ||
parser.add_argument('-t', '--typesetting', type=str, | ||
choices=['glyph', 'word', 'line', 'par', 'carea', 'page'], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the choice There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought i would need them, to remove char choices, but i've implemented it in another place. So i removed the "glyph" typesetting option. |
||
help='Maximum level of typesetting') | ||
parser.add_argument('-r', '--remove-properties', nargs='+', | ||
help='List of properties: {}'.format(','.join(properties))) | ||
parser.add_argument('fileout', nargs='?', | ||
help="Outputpath, default: print to terminal") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/Outputpath/Output path/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Also in the comment below.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Solved. |
||
parser.add_argument('-v', '--verbose', | ||
action='store_true', help='Verbose, default: %(default)s') | ||
|
||
args = parser.parse_args() | ||
|
||
doc = html.parse(args.file) | ||
# change level of typesetting | ||
if args.typesetting: | ||
# set maximum level of typesetting | ||
if args.typesetting in ["word"]: | ||
args.typesetting = "ocrx_" + args.typesetting | ||
else: | ||
args.typesetting = "ocr_" + args.typesetting | ||
|
||
# apply new level of typesetting | ||
for node in doc.xpath("//*[@class='{}']".format(args.typesetting)): | ||
if args.verbose: | ||
print(re.sub(r'\s+', '\x20', node.text_content()).strip()) | ||
node.text = node.text_content().strip() | ||
for child in list(node): | ||
node.remove(child) | ||
|
||
# remove properties | ||
if args.remove_properties: | ||
for node in doc.xpath("//*[@title]"): | ||
title = node.get("title") | ||
for prop in title.split(";"): | ||
(key, args) = prop.strip().split(None, 1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be fair i've took this part from hocr-cut. |
||
if key in args.remove_properties: | ||
if args.verbose: | ||
print("Replaced :{}".format(title)) | ||
title = title.replace(prop + ";", "").strip() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not work when the property is the last one (no semi-colon then). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternatively, you can also try something like this, which looks much shorter (code not yet tested): title = node.get("title")
title = re.sub(r"\s?(%s)\s+[^;$];?*" % args.remove_properties.join("|"), "") BTW don't you have to save it back in the doc somehow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could use https://github.com/kba/hocr-spec-python/blob/master/hocr_spec/spec.py#L530 to parse the properties There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but we don't need to parse it in details, we just have to delete the parameters together with their values, which are not needed anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestions. If reworked this part, but without regexp. Also i had to replace the double quotation with single ones.
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also have to update the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Solved. |
||
# if no outputpath is given, print to terminal | ||
if args.fileout is None: | ||
print(etree.tostring(doc, pretty_print=True).decode('UTF-8')) | ||
else: | ||
# create output path if needed | ||
if not os.path.isdir(os.path.dirname(args.fileout)): | ||
os.makedirs(os.path.dirname(args.fileout)) | ||
|
||
# write new hocr-files | ||
with open(args.fileout, "w") as f: | ||
f.writelines(etree.tostring(doc, pretty_print=True).decode('UTF-8')) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
#!/usr/bin/env tsht | ||
TESTDATA="../testdata" | ||
SIMPLEFILE="./tess.simple.hocr" | ||
|
||
plan 5 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is the number of test cases, i.e. should be 2 here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed Plan 5 to Plan 3. I added two more test case, with the new char choice options. |
||
|
||
after () { | ||
rm -f "$SIMPLEFILE" | ||
} | ||
hocr-simplify "$TESTDATA/tess.hocr" -t page > "$SIMPLEFILE" || fail 'hocr-simplify' | ||
equals 3870 $(ls -l "$SIMPLEFILE" | cut -d " " -f5 ) 'filesize == 3870' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
#!/usr/bin/env tsht | ||
|
||
for f in check combine eval eval-geom eval-lines extract-g1000 extract-images lines merge-dc pdf split;do | ||
for f in check combine eval eval-geom eval-lines extract-g1000 extract-images lines merge-dc pdf split simplify;do | ||
exec_ok "hocr-$f" "--help" | ||
exec_ok "hocr-$f" "-h" | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have also an option to delete
id
and/ordir
parameter, but they are on their own.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing attributes is now implemented