-
Notifications
You must be signed in to change notification settings - Fork 8.6k
update recovery #7259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update recovery #7259
Conversation
Thanks for your contribution! |
ppocr/utils/utility.py
Outdated
import fitz | ||
from PIL import Image | ||
imgs = [] | ||
pdf = fitz.open(img_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以按照下面这样写法试下,避免手动操作pdf.close()
with fitz.open(img_path) as pdf:
for pg in range(0, .....
....
ppocr/utils/utility.py
Outdated
mat = fitz.Matrix(2, 2) | ||
pm = page.getPixmap(matrix=mat, alpha=False) | ||
|
||
if pm.width>2000 or pm.height>2000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
提交代码之前做一下pre-commit,另外就是这里的2000,建议给一下注释,不然不知道为啥设置这个值
ppstructure/predict_system.py
Outdated
from ppstructure.recovery.recovery_to_doc import convert_info_docx | ||
convert_info_docx(img, res, save_folder, img_name, args.save_pdf) | ||
except: | ||
logger.error("error in layout recovery image:{}".format(image_file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也打印下具体的报错信息,方便定位
try Exception as ex:
ppstructure/predict_system.py
Outdated
if args.recovery and all_res != []: | ||
try: | ||
convert_info_docx(img, all_res, save_folder, img_name, args.save_pdf) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
@@ -46,7 +47,7 @@ def convert_info_docx(img, res, save_folder, img_name): | |||
section._sectPr.xpath('./w:cols')[0].set(qn('w:num'), '2') | |||
flag = 2 | |||
|
|||
if region['type'] == 'Figure': | |||
if region['type'] == 'figure': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里处理的时候加一个.lower()
吧
Tell HTMLParser to ignore any tags until the corresponding closing table tag | ||
""" | ||
doc = Document() | ||
table_soup = BeautifulSoup(html, 'html.parser') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件名是必需这样写还是说是一个可以传入的配置呢?
user can pass existing document object as arg | ||
(if they want to manage rest of document themselves) | ||
How to deal with block level style applied over table elements? e.g. text align | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
添加下license,另外如果是参考别的实现的话,在前面也添加下引用,可以参考ppocr/modeling/backbone/*
里面的引用方法
ppstructure/predict_system.py
Outdated
for index, img in enumerate(imgs): | ||
res, time_dict = structure_sys(img, str(index)) | ||
if structure_sys.mode == 'structure' and res != []: | ||
save_structure_res(res, save_folder, img_name, str(index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个index没有必要单独处理为字符串?直接用int应该就好,对应函数的默认值也改下
ppstructure/predict_system.py
Outdated
@@ -215,27 +218,74 @@ def main(args): | |||
for i, image_file in enumerate(image_file_list): | |||
logger.info("[{}/{}] {}".format(i, img_num, image_file)) | |||
img, flag = check_and_read_gif(image_file) | |||
imgs, flag_pdf = check_and_read_pdf(image_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新建一个函数check_and_read, 在里面进行判断gif和pdf,避免在外面调用两次
img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples) | ||
img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR) | ||
imgs.append(img) | ||
return imgs, False, True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是否返回imgs,true就可以了, 现在返回两个flag,下面的判断语句会一直是false
更新版面恢复代码