update recovery #7259

an1018 · 2022-08-19T01:47:36Z

更新版面恢复代码

paddle-bot · 2022-08-19T01:47:39Z

Thanks for your contribution!

littletomatodonkey · 2022-08-19T01:57:55Z

ppocr/utils/utility.py

+        import fitz
+        from PIL import Image
+        imgs = []
+        pdf = fitz.open(img_path)


可以按照下面这样写法试下，避免手动操作pdf.close()

with fitz.open(img_path) as pdf: for pg in range(0, ..... ....

littletomatodonkey · 2022-08-19T01:58:38Z

ppocr/utils/utility.py

+            mat = fitz.Matrix(2, 2)
+            pm = page.getPixmap(matrix=mat, alpha=False)
+
+            if pm.width>2000 or pm.height>2000:


提交代码之前做一下pre-commit，另外就是这里的2000，建议给一下注释，不然不知道为啥设置这个值

littletomatodonkey · 2022-08-19T02:05:06Z

ppstructure/predict_system.py

+                    from ppstructure.recovery.recovery_to_doc import convert_info_docx
+                    convert_info_docx(img, res, save_folder, img_name, args.save_pdf) 
+                except:
+                    logger.error("error in layout recovery image:{}".format(image_file))


这里也打印下具体的报错信息，方便定位

try Exception as ex:

littletomatodonkey · 2022-08-19T02:05:18Z

ppstructure/predict_system.py

+            if args.recovery and  all_res != []:
+                try:
+                    convert_info_docx(img, all_res, save_folder, img_name, args.save_pdf) 
+                except:


littletomatodonkey · 2022-08-19T02:07:05Z

ppstructure/recovery/recovery_to_doc.py

@@ -46,7 +47,7 @@ def convert_info_docx(img, res, save_folder, img_name):
            section._sectPr.xpath('./w:cols')[0].set(qn('w:num'), '2')
            flag = 2

-        if region['type'] == 'Figure':
+        if region['type'] == 'figure':


这里处理的时候加一个.lower()吧

littletomatodonkey · 2022-08-19T02:08:59Z

ppstructure/recovery/table_process.py

+        Tell HTMLParser to ignore any tags until the corresponding closing table tag
+        """
+        doc = Document()
+        table_soup = BeautifulSoup(html, 'html.parser')


这个文件名是必需这样写还是说是一个可以传入的配置呢？

littletomatodonkey · 2022-08-19T02:10:14Z

ppstructure/recovery/table_process.py

+user can pass existing document object as arg 
+(if they want to manage rest of document themselves)
+How to deal with block level style applied over table elements? e.g. text align
+"""


添加下license，另外如果是参考别的实现的话，在前面也添加下引用，可以参考ppocr/modeling/backbone/*里面的引用方法

littletomatodonkey · 2022-08-19T02:11:59Z

ppstructure/predict_system.py

+            for index, img in enumerate(imgs):
+                res, time_dict = structure_sys(img, str(index))
+                if structure_sys.mode == 'structure' and res != []:
+                    save_structure_res(res, save_folder, img_name, str(index))


这个index没有必要单独处理为字符串？直接用int应该就好，对应函数的默认值也改下

WenmuZhou · 2022-08-19T09:01:41Z

ppstructure/predict_system.py

@@ -215,27 +218,74 @@ def main(args):
    for i, image_file in enumerate(image_file_list):
        logger.info("[{}/{}] {}".format(i, img_num, image_file))
        img, flag = check_and_read_gif(image_file)
+        imgs, flag_pdf = check_and_read_pdf(image_file)


新建一个函数check_and_read, 在里面进行判断gif和pdf，避免在外面调用两次

WenmuZhou · 2022-08-19T10:01:56Z

ppocr/utils/utility.py

+                img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
+                img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
+                imgs.append(img)
+            return imgs, False, True


这里是否返回imgs,true就可以了, 现在返回两个flag，下面的判断语句会一直是false

update recovery

f335e6c

littletomatodonkey reviewed Aug 19, 2022

View reviewed changes

an1018 added 3 commits August 19, 2022 10:59

update recovery

7c3a2e8

update recovery

f11f7c6

update recovery

cf01657

littletomatodonkey previously approved these changes Aug 19, 2022

View reviewed changes

WenmuZhou reviewed Aug 19, 2022

View reviewed changes

update recovery

74f6fa6

an1018 dismissed littletomatodonkey’s stale review via 74f6fa6 August 19, 2022 09:49

WenmuZhou reviewed Aug 19, 2022

View reviewed changes

littletomatodonkey approved these changes Aug 19, 2022

View reviewed changes

littletomatodonkey merged commit b7d99ac into PaddlePaddle:dygraph Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update recovery #7259

update recovery #7259

Uh oh!

an1018 commented Aug 19, 2022

Uh oh!

paddle-bot bot commented Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

littletomatodonkey Aug 19, 2022

Uh oh!

WenmuZhou Aug 19, 2022

Uh oh!

WenmuZhou Aug 19, 2022

Uh oh!

Uh oh!

update recovery #7259

update recovery #7259

Uh oh!

Conversation

an1018 commented Aug 19, 2022

Uh oh!

paddle-bot bot commented Aug 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!