adding validation

opendatalab · Jul 15, 2024 · 097374b · 097374b · Lmillan123 · Jul 18, 2024
1 parent c1d7455
commit 097374b
Show file tree

Hide file tree

Showing 11 changed files with 634 additions and 0 deletions.
diff --git a/README-zh_CN.md b/README-zh_CN.md
@@ -79,6 +79,8 @@ PDF内容提取框架如下图所示
 
 现有开源模型多基于Arxiv论文类型数据进行训练，面对多样性的PDF文档，提前质量远不能达到实用需求。相比之下，我们的模型经过多样化数据训练，可以适应各种类型文档提取。
 
+评测代码及详细信息请看[这里](./assets/validation/README-zh_CN.md)。
+
 <span id="layout-anchor"></span>
 ### 布局检测
 

diff --git a/README.md b/README.md
@@ -78,6 +78,8 @@ By annotating a variety of PDF documents, we have trained robust models for `lay
 
 Existing open-source models are often trained on data from Arxiv papers and fall short when facing diverse PDF documents. In contrast, our models, trained on diverse data, are capable of adapting to various document types for extraction.
 
+The introduction of Validation process can be seen [here](./assets/validation/README.md).
+
 <span id="layout-anchor"></span>
 ### Layout Detection
 

diff --git a/assets/validation/README-zh_CN.md b/assets/validation/README-zh_CN.md
@@ -0,0 +1,52 @@
+# 验证
+
+在模型迭代的过程中，我们遵循各个模型各自的GitHub上提供的验证代码来输出验证结果，如果没有合适的验证代码，我们在其代码基础上进行了开发，详情请参考：
+
+- 布局检测：使用[LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3)；
+- 公式检测：使用[YOLOv8](https://github.com/ultralytics/ultralytics)；
+
+公式识别和光学字符识别我们使用的是[UniMERNet](https://github.com/opendatalab/UniMERNet)和[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)官方提供的权重，没有做进一步的训练和验证，因此不涉及验证代码。
+
+除此之外，如果想要直接对本pipeline输出的结果进行验证，我们也提供了一个脚本供参考。
+
+验证数据由于版权原因无法公开。
+
+## 布局检测
+
+布局检测使用的是[LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3)官方提供的验证代码：
+
+```
+python train_net.py --config-file config.yaml --eval-only --num-gpus 8 \
+        MODEL.WEIGHTS /path/to/your/model_final.pth \
+        OUTPUT_DIR /path/to/save/dir
+```
+
+## 公式检测
+
+公式检测的部分，我们在[YOLOv8](https://github.com/ultralytics/ultralytics)的基础上新增了验证代码。
+
+首先，需要将`./modules/yolov8/mfd_val.py`放在`~/ultralytics/models/yolo/detect`路径下，作用是新增MFDValidator类别。
+
+然后将需要用到的yaml文件放在`~/ultralytics/cfg/mfd_dataset`下，这里给了一个示例：`./modules/yolov8/opendata.yaml`。
+
+最后将验证的代码直接放在`~/ultralytics/`路径下，验证代码在`./modules/yolov8/eval_mfd.py`。
+
+运行的脚本可以参考`./modules/yolov8/eval_mfd_1888.sh`，具体运行的命令如下：
+
+```
+bash eval_mfd_1888.sh /path/to/your/trained/yolov8/weights
+```
+
+注意，这里用的图像大小默认是1888，可以通过--imsize参数设置。
+
+## Pipeline输出验证
+
+Pipeline输出结果的格式已经在[README](../../README-zh_CN.md)中展示，请参考这个格式准备验证数据。
+
+我们提供了一个直接验证Pipeline输出结果的代码和示例数据（非真实数据，不代表本pipeline真实验证结果），请直接在本目录下运行以下命令：
+
+```
+python pdf_validation.py
+```
+
+
diff --git a/assets/validation/README.md b/assets/validation/README.md
@@ -0,0 +1,50 @@
+# Validation
+
+During the model training and updating process, we follow the validation process provided on its GitHub for each model to test the ability of the trained models. If there is no validation code provided, we have developed it based on its code. For details, please refer to:
+
+- **Layout Detection**: Using the [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3);
+- **Formula Detection**: Using [YOLOv8](https://github.com/ultralytics/ultralytics);
+
+**Formula Recognition** and **Optical Character Recognition** using the existing weight provided on [UniMERNet](https://github.com/opendatalab/UniMERNet) and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), so no validation process involved.
+
+In addition, if you wish to directly verify the results output by this pipeline, we have also provided a script for reference.
+
+Due to copyright reasons, the validation datasets cannot be made public.
+
+## Layout Detection
+
+For Layout Detection, we use the validation process officiently provided in [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3):
+
+```
+python train_net.py --config-file config.yaml --eval-only --num-gpus 8 \
+        MODEL.WEIGHTS /path/to/your/model_final.pth \
+        OUTPUT_DIR /path/to/save/dir
+```
+
+## Formula Detection
+
+For Formula Detection, we have developed validation process based on [YOLOv8](https://github.com/ultralytics/ultralytics).
+
+Firstly, put the python file we provided in `./modules/yolov8/mfd_val.py` to `~/ultralytics/models/yolo/detect`, which means to add a new class named MFDValidator.
+
+Sencondly, place the required YAML file in the directory `~/ultralytics/cfg/mfd_dataset`. Here is an example provided: `./modules/yolov8/opendata.yaml`.
+
+Lastly, place the validation code directly in the `~/ultralytics/` directory. The validation code is located at `./modules/yolov8/eval_mfd.py`.
+
+The script for running can be referred to at `./modules/yolov8/eval_mfd_1888.sh`. The command to run is as follows:
+
+```
+bash eval_mfd_1888.sh /path/to/your/trained/yolov8/weights
+```
+
+Note that the default image size used here is 1888, which can be set through the `--imsize` parameter.
+
+## Pipeline Output Verification
+
+The format of the Pipeline output has been shown in the [README](../../README-zh_CN.md), please prepare the validation dataset according to this format.
+
+We provide a code for directly verifying the Pipeline output and a demo data (not real data, does not represent the actual accuracy of this pipeline), please run the following command directly in this directory:
+
+```
+python pdf_validation.py
+```
diff --git a/assets/validation/gt/0a03daa809992a0946a1b4c52f8bbca1e90689daa78ade949e878b3bb3c14bbf.json b/assets/validation/gt/0a03daa809992a0946a1b4c52f8bbca1e90689daa78ade949e878b3bb3c14bbf.json
@@ -0,0 +1 @@
+[{"layout_dets": [{"category_id": 1, "poly": [189.79238891601562, 1164.9248046875, 1078.8006591796875, 1164.9248046875, 1078.8006591796875, 1509.005615234375, 189.79238891601562, 1509.005615234375]}, {"category_id": 1, "poly": [194.54269409179688, 793.9232788085938, 1078.212158203125, 793.9232788085938, 1078.212158203125, 920.1719360351562, 194.54269409179688, 920.1719360351562]}, {"category_id": 1, "poly": [189.8138427734375, 1545.0101318359375, 1079.3756103515625, 1545.0101318359375, 1079.3756103515625, 1931.9613037109375, 189.8138427734375, 1931.9613037109375]}, {"category_id": 0, "poly": [136.71432495117188, 296.93731689453125, 985.5293579101562, 296.93731689453125, 985.5293579101562, 368.8143005371094, 136.71432495117188, 368.8143005371094]}, {"category_id": 1, "poly": [194.91012573242188, 956.7483520507812, 1074.348388671875, 956.7483520507812, 1074.348388671875, 1128.3656005859375, 194.91012573242188, 1128.3656005859375]}, {"category_id": 0, "poly": [156.7171630859375, 620.3768920898438, 378.1712341308594, 620.3768920898438, 378.1712341308594, 669.6041259765625, 156.7171630859375, 669.6041259765625]}, {"category_id": 1, "poly": [197.78273010253906, 718.4710693359375, 524.1154174804688, 718.4710693359375, 524.1154174804688, 754.3604736328125, 197.78273010253906, 754.3604736328125]}, {"category_id": 1, "poly": [193.27609252929688, 1968.3173828125, 1078.6533203125, 1968.3173828125, 1078.6533203125, 2095.548095703125, 193.27609252929688, 2095.548095703125]}, {"category_id": 2, "poly": [1118.4627685546875, 195.16065979003906, 1520.6173095703125, 195.16065979003906, 1520.6173095703125, 292.206298828125, 1118.4627685546875, 292.206298828125]}, {"category_id": 1, "poly": [138.43724060058594, 469.3943786621094, 386.7919616699219, 469.3943786621094, 386.7919616699219, 506.1737060546875, 138.43724060058594, 506.1737060546875]}, {"category_id": 2, "poly": [0, 0, 96.55510826807529, 0, 96.55510826807529, 2339, 0, 2339]}, {"category_id": 0, "poly": [1161.410400390625, 1587.11328125, 1281.769775390625, 1587.11328125, 1281.769775390625, 1622.741943359375, 1161.410400390625, 1622.741943359375]}, {"category_id": 1, "poly": [1162.7374267578125, 1981.654541015625, 1563.3895263671875, 1981.654541015625, 1563.3895263671875, 2056.316162109375, 1162.7374267578125, 2056.316162109375]}, {"category_id": 5, "poly": [1151.3074931826534, 1226.826760642516, 1577.2250007348794, 1226.826760642516, 1577.2250007348794, 1576.904854081628, 1151.3074931826534, 1576.904854081628]}, {"category_id": 1, "poly": [1164.3848876953125, 1849.25146484375, 1564.921142578125, 1849.25146484375, 1564.921142578125, 1966.5108642578125, 1164.3848876953125, 1966.5108642578125]}, {"category_id": 5, "poly": [1147.6124267578125, 571.0896606445312, 1576.5280312167042, 571.0896606445312, 1576.5280312167042, 870.3014106223914, 1147.6124267578125, 870.3014106223914]}, {"category_id": 1, "poly": [1233.2554931640625, 324.8569641113281, 1455.9296875, 324.8569641113281, 1455.9296875, 360.7992858886719, 1233.2554931640625, 360.7992858886719]}, {"category_id": 1, "poly": [1161.2802734375, 1717.918701171875, 1564.3883056640625, 1717.918701171875, 1564.3883056640625, 1803.820556640625, 1161.2802734375, 1803.820556640625]}, {"category_id": 2, "poly": [137.0236358642578, 193.99758911132812, 634.6810913085938, 193.99758911132812, 634.6810913085938, 234.8846435546875, 137.0236358642578, 234.8846435546875]}, {"category_id": 1, "poly": [1166.559326171875, 1630.218505859375, 1567.7684326171875, 1630.218505859375, 1567.7684326171875, 1707.9720458984375, 1166.559326171875, 1707.9720458984375]}, {"category_id": 4, "poly": [1161.3724365234375, 879.9352416992188, 1336.1776123046875, 879.9352416992188, 1336.1776123046875, 917.6112060546875, 1161.3724365234375, 917.6112060546875]}, {"category_id": 1, "poly": [1161.092529296875, 409.3352966308594, 1336.5216064453125, 409.3352966308594, 1336.5216064453125, 459.2389221191406, 1161.092529296875, 459.2389221191406]}, {"category_id": 3, "poly": [1153.4383171276972, 919.8226041441173, 1566.1940917968748, 919.8226041441173, 1566.1940917968748, 1206.801025390625, 1153.4383171276972, 1206.801025390625]}, {"category_id": 1, "poly": [1435.08154296875, 2069.21826171875, 1566.265625, 2069.21826171875, 1566.265625, 2095.0615234375, 1435.08154296875, 2095.0615234375]}], "page_info": {"page_no": 0, "height": 2339, "width": 1654}}, {"layout_dets": [{"category_id": 1, "poly": [189.79238891601562, 1164.9248046875, 1078.8006591796875, 1164.9248046875, 1078.8006591796875, 1509.005615234375, 189.79238891601562, 1509.005615234375]}, {"category_id": 1, "poly": [194.54269409179688, 793.9232788085938, 1078.212158203125, 793.9232788085938, 1078.212158203125, 920.1719360351562, 194.54269409179688, 920.1719360351562]}, {"category_id": 1, "poly": [189.8138427734375, 1545.0101318359375, 1079.3756103515625, 1545.0101318359375, 1079.3756103515625, 1931.9613037109375, 189.8138427734375, 1931.9613037109375]}, {"category_id": 0, "poly": [136.71432495117188, 296.93731689453125, 985.5293579101562, 296.93731689453125, 985.5293579101562, 368.8143005371094, 136.71432495117188, 368.8143005371094]}, {"category_id": 1, "poly": [194.91012573242188, 956.7483520507812, 1074.348388671875, 956.7483520507812, 1074.348388671875, 1128.3656005859375, 194.91012573242188, 1128.3656005859375]}, {"category_id": 0, "poly": [156.7171630859375, 620.3768920898438, 378.1712341308594, 620.3768920898438, 378.1712341308594, 669.6041259765625, 156.7171630859375, 669.6041259765625]}, {"category_id": 1, "poly": [197.78273010253906, 718.4710693359375, 524.1154174804688, 718.4710693359375, 524.1154174804688, 754.3604736328125, 197.78273010253906, 754.3604736328125]}, {"category_id": 1, "poly": [193.27609252929688, 1968.3173828125, 1078.6533203125, 1968.3173828125, 1078.6533203125, 2095.548095703125, 193.27609252929688, 2095.548095703125]}, {"category_id": 2, "poly": [1118.4627685546875, 195.16065979003906, 1520.6173095703125, 195.16065979003906, 1520.6173095703125, 292.206298828125, 1118.4627685546875, 292.206298828125]}, {"category_id": 1, "poly": [138.43724060058594, 469.3943786621094, 386.7919616699219, 469.3943786621094, 386.7919616699219, 506.1737060546875, 138.43724060058594, 506.1737060546875]}, {"category_id": 2, "poly": [0, 0, 96.55510826807529, 0, 96.55510826807529, 2339, 0, 2339]}, {"category_id": 0, "poly": [1161.410400390625, 1587.11328125, 1281.769775390625, 1587.11328125, 1281.769775390625, 1622.741943359375, 1161.410400390625, 1622.741943359375]}, {"category_id": 1, "poly": [1162.7374267578125, 1981.654541015625, 1563.3895263671875, 1981.654541015625, 1563.3895263671875, 2056.316162109375, 1162.7374267578125, 2056.316162109375]}, {"category_id": 5, "poly": [1151.3074931826534, 1226.826760642516, 1577.2250007348794, 1226.826760642516, 1577.2250007348794, 1576.904854081628, 1151.3074931826534, 1576.904854081628]}, {"category_id": 1, "poly": [1164.3848876953125, 1849.25146484375, 1564.921142578125, 1849.25146484375, 1564.921142578125, 1966.5108642578125, 1164.3848876953125, 1966.5108642578125]}, {"category_id": 5, "poly": [1147.6124267578125, 571.0896606445312, 1576.5280312167042, 571.0896606445312, 1576.5280312167042, 870.3014106223914, 1147.6124267578125, 870.3014106223914]}, {"category_id": 1, "poly": [1233.2554931640625, 324.8569641113281, 1455.9296875, 324.8569641113281, 1455.9296875, 360.7992858886719, 1233.2554931640625, 360.7992858886719]}, {"category_id": 1, "poly": [1161.2802734375, 1717.918701171875, 1564.3883056640625, 1717.918701171875, 1564.3883056640625, 1803.820556640625, 1161.2802734375, 1803.820556640625]}, {"category_id": 2, "poly": [137.0236358642578, 193.99758911132812, 634.6810913085938, 193.99758911132812, 634.6810913085938, 234.8846435546875, 137.0236358642578, 234.8846435546875]}, {"category_id": 1, "poly": [1166.559326171875, 1630.218505859375, 1567.7684326171875, 1630.218505859375, 1567.7684326171875, 1707.9720458984375, 1166.559326171875, 1707.9720458984375]}, {"category_id": 4, "poly": [1161.3724365234375, 879.9352416992188, 1336.1776123046875, 879.9352416992188, 1336.1776123046875, 917.6112060546875, 1161.3724365234375, 917.6112060546875]}, {"category_id": 1, "poly": [1161.092529296875, 409.3352966308594, 1336.5216064453125, 409.3352966308594, 1336.5216064453125, 459.2389221191406, 1161.092529296875, 459.2389221191406]}, {"category_id": 3, "poly": [1153.4383171276972, 919.8226041441173, 1566.1940917968748, 919.8226041441173, 1566.1940917968748, 1206.801025390625, 1153.4383171276972, 1206.801025390625]}, {"category_id": 1, "poly": [1435.08154296875, 2069.21826171875, 1566.265625, 2069.21826171875, 1566.265625, 2095.0615234375, 1435.08154296875, 2095.0615234375]}], "page_info": {"page_no": 1, "height": 2339, "width": 1654}}]
diff --git a/assets/validation/modules/yolov8/eval_mfd.py b/assets/validation/modules/yolov8/eval_mfd.py
@@ -0,0 +1,120 @@
+import os
+import json
+import argparse
+from natsort import natsorted
+from ultralytics.models.yolo.detect import MFDValidator
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--weight', type=str, default='yolov8l.yaml', help='model.yaml path')
+    parser.add_argument('--imsize', type=int, default=1280, help='image sizes')
+    parser.add_argument('--conf', type=float, default=0.25, help='object confidence threshold')
+    parser.add_argument('--iou', type=float, default=0.45, help='IOU threshold for NMS')
+    parser.add_argument('--cfg1', type=str, default='', help='Yaml file of validation')
+    parser.add_argument('--cfg2', type=str, default='', help='Yaml file of validation')
+    args = parser.parse_args()
+    if args.weight.endswith("/"):
+        args.weight = args.weight[0:-1]
+    print(args)
+
+    if args.weight.endswith(".pt"): ## 评测单个模型，可视化box
+        model_name = os.path.basename(os.path.dirname(os.path.dirname(args.weight)))
+
+        input_args1 = dict(
+            model=args.weight, 
+            data=args.cfg1,
+            imgsz=args.imsize,
+            conf=0.25,
+            iou=0.45)
+        eval_name1 = args.cfg1.split('.')[0]
+        vis_dir = f"/mnt/hwfile/opendatalab/ouyanglinke/PDF_Formula/vis_v8/{eval_name1}--{model_name}"
+        validator1 = MFDValidator(args=input_args1, save_dir=vis_dir)
+        res1 = validator1()
+
+        if args.cfg2:
+            input_args2 = dict(
+                model=args.weight, 
+                data=args.cfg2,
+                imgsz=args.imsize,
+                conf=0.25,
+                iou=0.45)
+            eval_name2 = args.cfg2.split('.')[0]
+            vis_dir = f"/mnt/hwfile/opendatalab/ouyanglinke/PDF_Formula/vis_v8/{eval_name2}--{model_name}"
+            validator2 = MFDValidator(args=input_args2, save_dir=vis_dir)
+            res2 = validator2()
+        else:
+            res2 = False
+        if res1 and res2:
+            print("metrics:", [res1['AP50'], res1['AR50'], res2['AP50'], res2['AR50']])
+        elif res1:
+            print("metrics:", [res1['AP50'], res1['AR50']])
+        else:
+            print("metrics:", [0, 0, 0, 0])
+
+    else:   ## 评测多个模型，不可视化，且找出best.pt
+        best_score = -1
+        best_metrics = None
+        best_model = None
+        epoch_eval_results = {"epoch_res":{}}
+        for model_name in natsorted(os.listdir(args.weight)):
+            model_path = os.path.join(args.weight, model_name)
+            if not "epoch" in model_name:
+                continue
+            epoch = int(model_name[5:-3])
+            print("==> eval at epoch", epoch)
+
+            input_args1 = dict(
+                model=model_path, 
+                data=args.cfg1,
+                imgsz=args.imsize,
+                conf=0.25,
+                iou=0.45)
+            validator1 = MFDValidator(args=input_args1, save_dir="runs/vis")
+            res1 = validator1(vis_box=False)
+
+            if args.cfg2:
+                input_args2 = dict(
+                    model=model_path, 
+                    data=args.cfg2,
+                    imgsz=args.imsize,
+                    conf=0.25,
+                    iou=0.45)
+                validator2 = MFDValidator(args=input_args2, save_dir="runs/vis")
+                res2 = validator2(vis_box=False)
+            else:
+                res2 = False
+
+            if res1 and res2:
+                model_score = 0.2*res1['AP50'] + 0.3*res1['AR50'] + 0.2*res2['AP50'] + 0.3*res2['AR50']
+                epoch_eval_results["epoch_res"][epoch] = {
+                    "score": model_score, 
+                    "metrics": [res1['AP50'], res1['AR50'], res2['AP50'], res2['AR50']]
+                }
+            elif res1:
+                model_score = 0.4*res1['AP50'] + 0.6*res1['AR50']
+                epoch_eval_results["epoch_res"][epoch] = {
+                    "score": model_score, 
+                    "metrics": [res1['AP50'], res1['AR50']]
+                }
+            else:
+                model_score = 0
+                epoch_eval_results["epoch_res"][epoch] = {
+                    "score": model_score, 
+                    "metrics": [0, 0, 0, 0]
+                }
+
+            if model_score > best_score:
+                best_score = model_score
+                best_model = model_name
+                if res1 and res2:
+                    best_metrics = [res1['AP50'], res1['AR50'], res2['AP50'], res2['AR50']]
+                elif res1:
+                    best_metrics = [res1['AP50'], res1['AR50']]
+
+        print("best epoch:", best_model, "metrics:", best_metrics)
+
+        epoch_eval_results["best_score"] = best_score
+        epoch_eval_results["best_epoch"] = int(best_model[5:-3])
+        epoch_eval_results["best_metrics"] = best_metrics
+        with open(os.path.join(os.path.dirname(args.weight), "epoch_eval_results.json"), "w") as f:
+            f.write(json.dumps(epoch_eval_results, indent=2))
diff --git a/assets/validation/modules/yolov8/eval_mfd_1888.sh b/assets/validation/modules/yolov8/eval_mfd_1888.sh
@@ -0,0 +1,8 @@
+weight=$1
+imsize=${2:-1888}
+
+srun -p s2_bigdata --gres=gpu:1 --async \
+~/anaconda3/envs/yolov8/bin/python eval_mfd.py \
+--weight ${weight} --imsize ${imsize} --cfg1 opendata.yaml
+
+rm batchscript*