Skip to content

Commit

Permalink
adding validation
Browse files Browse the repository at this point in the history
  • Loading branch information
ouyanglinke committed Jul 15, 2024
1 parent c1d7455 commit 097374b
Show file tree
Hide file tree
Showing 11 changed files with 634 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ PDF内容提取框架如下图所示

现有开源模型多基于Arxiv论文类型数据进行训练,面对多样性的PDF文档,提前质量远不能达到实用需求。相比之下,我们的模型经过多样化数据训练,可以适应各种类型文档提取。

评测代码及详细信息请看[这里](./assets/validation/README-zh_CN.md)

<span id="layout-anchor"></span>
### 布局检测

Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ By annotating a variety of PDF documents, we have trained robust models for `lay

Existing open-source models are often trained on data from Arxiv papers and fall short when facing diverse PDF documents. In contrast, our models, trained on diverse data, are capable of adapting to various document types for extraction.

The introduction of Validation process can be seen [here](./assets/validation/README.md).

<span id="layout-anchor"></span>
### Layout Detection

Expand Down
52 changes: 52 additions & 0 deletions assets/validation/README-zh_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# 验证

在模型迭代的过程中,我们遵循各个模型各自的GitHub上提供的验证代码来输出验证结果,如果没有合适的验证代码,我们在其代码基础上进行了开发,详情请参考:

- 布局检测:使用[LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3)
- 公式检测:使用[YOLOv8](https://github.com/ultralytics/ultralytics)

公式识别和光学字符识别我们使用的是[UniMERNet](https://github.com/opendatalab/UniMERNet)[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)官方提供的权重,没有做进一步的训练和验证,因此不涉及验证代码。

除此之外,如果想要直接对本pipeline输出的结果进行验证,我们也提供了一个脚本供参考。

验证数据由于版权原因无法公开。

## 布局检测

布局检测使用的是[LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3)官方提供的验证代码:

```
python train_net.py --config-file config.yaml --eval-only --num-gpus 8 \
MODEL.WEIGHTS /path/to/your/model_final.pth \
OUTPUT_DIR /path/to/save/dir
```

## 公式检测

公式检测的部分,我们在[YOLOv8](https://github.com/ultralytics/ultralytics)的基础上新增了验证代码。

首先,需要将`./modules/yolov8/mfd_val.py`放在`~/ultralytics/models/yolo/detect`路径下,作用是新增MFDValidator类别。

然后将需要用到的yaml文件放在`~/ultralytics/cfg/mfd_dataset`下,这里给了一个示例:`./modules/yolov8/opendata.yaml`

最后将验证的代码直接放在`~/ultralytics/`路径下,验证代码在`./modules/yolov8/eval_mfd.py`

运行的脚本可以参考`./modules/yolov8/eval_mfd_1888.sh`,具体运行的命令如下:

```
bash eval_mfd_1888.sh /path/to/your/trained/yolov8/weights
```

注意,这里用的图像大小默认是1888,可以通过--imsize参数设置。

## Pipeline输出验证

Pipeline输出结果的格式已经在[README](../../README-zh_CN.md)中展示,请参考这个格式准备验证数据。

我们提供了一个直接验证Pipeline输出结果的代码和示例数据(非真实数据,不代表本pipeline真实验证结果),请直接在本目录下运行以下命令:

```
python pdf_validation.py
```


50 changes: 50 additions & 0 deletions assets/validation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Validation

During the model training and updating process, we follow the validation process provided on its GitHub for each model to test the ability of the trained models. If there is no validation code provided, we have developed it based on its code. For details, please refer to:

- **Layout Detection**: Using the [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3);
- **Formula Detection**: Using [YOLOv8](https://github.com/ultralytics/ultralytics);

**Formula Recognition** and **Optical Character Recognition** using the existing weight provided on [UniMERNet](https://github.com/opendatalab/UniMERNet) and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), so no validation process involved.

In addition, if you wish to directly verify the results output by this pipeline, we have also provided a script for reference.

Due to copyright reasons, the validation datasets cannot be made public.

## Layout Detection

For Layout Detection, we use the validation process officiently provided in [LayoutLMv3](https://github.com/microsoft/unilm/tree/master/layoutlmv3):

```
python train_net.py --config-file config.yaml --eval-only --num-gpus 8 \
MODEL.WEIGHTS /path/to/your/model_final.pth \
OUTPUT_DIR /path/to/save/dir
```

## Formula Detection

For Formula Detection, we have developed validation process based on [YOLOv8](https://github.com/ultralytics/ultralytics).

Firstly, put the python file we provided in `./modules/yolov8/mfd_val.py` to `~/ultralytics/models/yolo/detect`, which means to add a new class named MFDValidator.

Sencondly, place the required YAML file in the directory `~/ultralytics/cfg/mfd_dataset`. Here is an example provided: `./modules/yolov8/opendata.yaml`.

Lastly, place the validation code directly in the `~/ultralytics/` directory. The validation code is located at `./modules/yolov8/eval_mfd.py`.

The script for running can be referred to at `./modules/yolov8/eval_mfd_1888.sh`. The command to run is as follows:

```
bash eval_mfd_1888.sh /path/to/your/trained/yolov8/weights
```

Note that the default image size used here is 1888, which can be set through the `--imsize` parameter.

## Pipeline Output Verification

The format of the Pipeline output has been shown in the [README](../../README-zh_CN.md), please prepare the validation dataset according to this format.

We provide a code for directly verifying the Pipeline output and a demo data (not real data, does not represent the actual accuracy of this pipeline), please run the following command directly in this directory:

```
python pdf_validation.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"layout_dets": [{"category_id": 1, "poly": [189.79238891601562, 1164.9248046875, 1078.8006591796875, 1164.9248046875, 1078.8006591796875, 1509.005615234375, 189.79238891601562, 1509.005615234375]}, {"category_id": 1, "poly": [194.54269409179688, 793.9232788085938, 1078.212158203125, 793.9232788085938, 1078.212158203125, 920.1719360351562, 194.54269409179688, 920.1719360351562]}, {"category_id": 1, "poly": [189.8138427734375, 1545.0101318359375, 1079.3756103515625, 1545.0101318359375, 1079.3756103515625, 1931.9613037109375, 189.8138427734375, 1931.9613037109375]}, {"category_id": 0, "poly": [136.71432495117188, 296.93731689453125, 985.5293579101562, 296.93731689453125, 985.5293579101562, 368.8143005371094, 136.71432495117188, 368.8143005371094]}, {"category_id": 1, "poly": [194.91012573242188, 956.7483520507812, 1074.348388671875, 956.7483520507812, 1074.348388671875, 1128.3656005859375, 194.91012573242188, 1128.3656005859375]}, {"category_id": 0, "poly": [156.7171630859375, 620.3768920898438, 378.1712341308594, 620.3768920898438, 378.1712341308594, 669.6041259765625, 156.7171630859375, 669.6041259765625]}, {"category_id": 1, "poly": [197.78273010253906, 718.4710693359375, 524.1154174804688, 718.4710693359375, 524.1154174804688, 754.3604736328125, 197.78273010253906, 754.3604736328125]}, {"category_id": 1, "poly": [193.27609252929688, 1968.3173828125, 1078.6533203125, 1968.3173828125, 1078.6533203125, 2095.548095703125, 193.27609252929688, 2095.548095703125]}, {"category_id": 2, "poly": [1118.4627685546875, 195.16065979003906, 1520.6173095703125, 195.16065979003906, 1520.6173095703125, 292.206298828125, 1118.4627685546875, 292.206298828125]}, {"category_id": 1, "poly": [138.43724060058594, 469.3943786621094, 386.7919616699219, 469.3943786621094, 386.7919616699219, 506.1737060546875, 138.43724060058594, 506.1737060546875]}, {"category_id": 2, "poly": [0, 0, 96.55510826807529, 0, 96.55510826807529, 2339, 0, 2339]}, {"category_id": 0, "poly": [1161.410400390625, 1587.11328125, 1281.769775390625, 1587.11328125, 1281.769775390625, 1622.741943359375, 1161.410400390625, 1622.741943359375]}, {"category_id": 1, "poly": [1162.7374267578125, 1981.654541015625, 1563.3895263671875, 1981.654541015625, 1563.3895263671875, 2056.316162109375, 1162.7374267578125, 2056.316162109375]}, {"category_id": 5, "poly": [1151.3074931826534, 1226.826760642516, 1577.2250007348794, 1226.826760642516, 1577.2250007348794, 1576.904854081628, 1151.3074931826534, 1576.904854081628]}, {"category_id": 1, "poly": [1164.3848876953125, 1849.25146484375, 1564.921142578125, 1849.25146484375, 1564.921142578125, 1966.5108642578125, 1164.3848876953125, 1966.5108642578125]}, {"category_id": 5, "poly": [1147.6124267578125, 571.0896606445312, 1576.5280312167042, 571.0896606445312, 1576.5280312167042, 870.3014106223914, 1147.6124267578125, 870.3014106223914]}, {"category_id": 1, "poly": [1233.2554931640625, 324.8569641113281, 1455.9296875, 324.8569641113281, 1455.9296875, 360.7992858886719, 1233.2554931640625, 360.7992858886719]}, {"category_id": 1, "poly": [1161.2802734375, 1717.918701171875, 1564.3883056640625, 1717.918701171875, 1564.3883056640625, 1803.820556640625, 1161.2802734375, 1803.820556640625]}, {"category_id": 2, "poly": [137.0236358642578, 193.99758911132812, 634.6810913085938, 193.99758911132812, 634.6810913085938, 234.8846435546875, 137.0236358642578, 234.8846435546875]}, {"category_id": 1, "poly": [1166.559326171875, 1630.218505859375, 1567.7684326171875, 1630.218505859375, 1567.7684326171875, 1707.9720458984375, 1166.559326171875, 1707.9720458984375]}, {"category_id": 4, "poly": [1161.3724365234375, 879.9352416992188, 1336.1776123046875, 879.9352416992188, 1336.1776123046875, 917.6112060546875, 1161.3724365234375, 917.6112060546875]}, {"category_id": 1, "poly": [1161.092529296875, 409.3352966308594, 1336.5216064453125, 409.3352966308594, 1336.5216064453125, 459.2389221191406, 1161.092529296875, 459.2389221191406]}, {"category_id": 3, "poly": [1153.4383171276972, 919.8226041441173, 1566.1940917968748, 919.8226041441173, 1566.1940917968748, 1206.801025390625, 1153.4383171276972, 1206.801025390625]}, {"category_id": 1, "poly": [1435.08154296875, 2069.21826171875, 1566.265625, 2069.21826171875, 1566.265625, 2095.0615234375, 1435.08154296875, 2095.0615234375]}], "page_info": {"page_no": 0, "height": 2339, "width": 1654}}, {"layout_dets": [{"category_id": 1, "poly": [189.79238891601562, 1164.9248046875, 1078.8006591796875, 1164.9248046875, 1078.8006591796875, 1509.005615234375, 189.79238891601562, 1509.005615234375]}, {"category_id": 1, "poly": [194.54269409179688, 793.9232788085938, 1078.212158203125, 793.9232788085938, 1078.212158203125, 920.1719360351562, 194.54269409179688, 920.1719360351562]}, {"category_id": 1, "poly": [189.8138427734375, 1545.0101318359375, 1079.3756103515625, 1545.0101318359375, 1079.3756103515625, 1931.9613037109375, 189.8138427734375, 1931.9613037109375]}, {"category_id": 0, "poly": [136.71432495117188, 296.93731689453125, 985.5293579101562, 296.93731689453125, 985.5293579101562, 368.8143005371094, 136.71432495117188, 368.8143005371094]}, {"category_id": 1, "poly": [194.91012573242188, 956.7483520507812, 1074.348388671875, 956.7483520507812, 1074.348388671875, 1128.3656005859375, 194.91012573242188, 1128.3656005859375]}, {"category_id": 0, "poly": [156.7171630859375, 620.3768920898438, 378.1712341308594, 620.3768920898438, 378.1712341308594, 669.6041259765625, 156.7171630859375, 669.6041259765625]}, {"category_id": 1, "poly": [197.78273010253906, 718.4710693359375, 524.1154174804688, 718.4710693359375, 524.1154174804688, 754.3604736328125, 197.78273010253906, 754.3604736328125]}, {"category_id": 1, "poly": [193.27609252929688, 1968.3173828125, 1078.6533203125, 1968.3173828125, 1078.6533203125, 2095.548095703125, 193.27609252929688, 2095.548095703125]}, {"category_id": 2, "poly": [1118.4627685546875, 195.16065979003906, 1520.6173095703125, 195.16065979003906, 1520.6173095703125, 292.206298828125, 1118.4627685546875, 292.206298828125]}, {"category_id": 1, "poly": [138.43724060058594, 469.3943786621094, 386.7919616699219, 469.3943786621094, 386.7919616699219, 506.1737060546875, 138.43724060058594, 506.1737060546875]}, {"category_id": 2, "poly": [0, 0, 96.55510826807529, 0, 96.55510826807529, 2339, 0, 2339]}, {"category_id": 0, "poly": [1161.410400390625, 1587.11328125, 1281.769775390625, 1587.11328125, 1281.769775390625, 1622.741943359375, 1161.410400390625, 1622.741943359375]}, {"category_id": 1, "poly": [1162.7374267578125, 1981.654541015625, 1563.3895263671875, 1981.654541015625, 1563.3895263671875, 2056.316162109375, 1162.7374267578125, 2056.316162109375]}, {"category_id": 5, "poly": [1151.3074931826534, 1226.826760642516, 1577.2250007348794, 1226.826760642516, 1577.2250007348794, 1576.904854081628, 1151.3074931826534, 1576.904854081628]}, {"category_id": 1, "poly": [1164.3848876953125, 1849.25146484375, 1564.921142578125, 1849.25146484375, 1564.921142578125, 1966.5108642578125, 1164.3848876953125, 1966.5108642578125]}, {"category_id": 5, "poly": [1147.6124267578125, 571.0896606445312, 1576.5280312167042, 571.0896606445312, 1576.5280312167042, 870.3014106223914, 1147.6124267578125, 870.3014106223914]}, {"category_id": 1, "poly": [1233.2554931640625, 324.8569641113281, 1455.9296875, 324.8569641113281, 1455.9296875, 360.7992858886719, 1233.2554931640625, 360.7992858886719]}, {"category_id": 1, "poly": [1161.2802734375, 1717.918701171875, 1564.3883056640625, 1717.918701171875, 1564.3883056640625, 1803.820556640625, 1161.2802734375, 1803.820556640625]}, {"category_id": 2, "poly": [137.0236358642578, 193.99758911132812, 634.6810913085938, 193.99758911132812, 634.6810913085938, 234.8846435546875, 137.0236358642578, 234.8846435546875]}, {"category_id": 1, "poly": [1166.559326171875, 1630.218505859375, 1567.7684326171875, 1630.218505859375, 1567.7684326171875, 1707.9720458984375, 1166.559326171875, 1707.9720458984375]}, {"category_id": 4, "poly": [1161.3724365234375, 879.9352416992188, 1336.1776123046875, 879.9352416992188, 1336.1776123046875, 917.6112060546875, 1161.3724365234375, 917.6112060546875]}, {"category_id": 1, "poly": [1161.092529296875, 409.3352966308594, 1336.5216064453125, 409.3352966308594, 1336.5216064453125, 459.2389221191406, 1161.092529296875, 459.2389221191406]}, {"category_id": 3, "poly": [1153.4383171276972, 919.8226041441173, 1566.1940917968748, 919.8226041441173, 1566.1940917968748, 1206.801025390625, 1153.4383171276972, 1206.801025390625]}, {"category_id": 1, "poly": [1435.08154296875, 2069.21826171875, 1566.265625, 2069.21826171875, 1566.265625, 2095.0615234375, 1435.08154296875, 2095.0615234375]}], "page_info": {"page_no": 1, "height": 2339, "width": 1654}}]
120 changes: 120 additions & 0 deletions assets/validation/modules/yolov8/eval_mfd.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import os
import json
import argparse
from natsort import natsorted
from ultralytics.models.yolo.detect import MFDValidator

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--weight', type=str, default='yolov8l.yaml', help='model.yaml path')
parser.add_argument('--imsize', type=int, default=1280, help='image sizes')
parser.add_argument('--conf', type=float, default=0.25, help='object confidence threshold')
parser.add_argument('--iou', type=float, default=0.45, help='IOU threshold for NMS')
parser.add_argument('--cfg1', type=str, default='', help='Yaml file of validation')
parser.add_argument('--cfg2', type=str, default='', help='Yaml file of validation')
args = parser.parse_args()
if args.weight.endswith("/"):
args.weight = args.weight[0:-1]
print(args)

if args.weight.endswith(".pt"): ## 评测单个模型,可视化box
model_name = os.path.basename(os.path.dirname(os.path.dirname(args.weight)))

input_args1 = dict(
model=args.weight,
data=args.cfg1,
imgsz=args.imsize,
conf=0.25,
iou=0.45)
eval_name1 = args.cfg1.split('.')[0]
vis_dir = f"/mnt/hwfile/opendatalab/ouyanglinke/PDF_Formula/vis_v8/{eval_name1}--{model_name}"
validator1 = MFDValidator(args=input_args1, save_dir=vis_dir)
res1 = validator1()

if args.cfg2:
input_args2 = dict(
model=args.weight,
data=args.cfg2,
imgsz=args.imsize,
conf=0.25,
iou=0.45)
eval_name2 = args.cfg2.split('.')[0]
vis_dir = f"/mnt/hwfile/opendatalab/ouyanglinke/PDF_Formula/vis_v8/{eval_name2}--{model_name}"
validator2 = MFDValidator(args=input_args2, save_dir=vis_dir)
res2 = validator2()
else:
res2 = False
if res1 and res2:
print("metrics:", [res1['AP50'], res1['AR50'], res2['AP50'], res2['AR50']])
elif res1:
print("metrics:", [res1['AP50'], res1['AR50']])
else:
print("metrics:", [0, 0, 0, 0])

else: ## 评测多个模型,不可视化,且找出best.pt
best_score = -1
best_metrics = None
best_model = None
epoch_eval_results = {"epoch_res":{}}
for model_name in natsorted(os.listdir(args.weight)):
model_path = os.path.join(args.weight, model_name)
if not "epoch" in model_name:
continue
epoch = int(model_name[5:-3])
print("==> eval at epoch", epoch)

input_args1 = dict(
model=model_path,
data=args.cfg1,
imgsz=args.imsize,
conf=0.25,
iou=0.45)
validator1 = MFDValidator(args=input_args1, save_dir="runs/vis")
res1 = validator1(vis_box=False)

if args.cfg2:
input_args2 = dict(
model=model_path,
data=args.cfg2,
imgsz=args.imsize,
conf=0.25,
iou=0.45)
validator2 = MFDValidator(args=input_args2, save_dir="runs/vis")
res2 = validator2(vis_box=False)
else:
res2 = False

if res1 and res2:
model_score = 0.2*res1['AP50'] + 0.3*res1['AR50'] + 0.2*res2['AP50'] + 0.3*res2['AR50']
epoch_eval_results["epoch_res"][epoch] = {
"score": model_score,
"metrics": [res1['AP50'], res1['AR50'], res2['AP50'], res2['AR50']]
}
elif res1:
model_score = 0.4*res1['AP50'] + 0.6*res1['AR50']
epoch_eval_results["epoch_res"][epoch] = {
"score": model_score,
"metrics": [res1['AP50'], res1['AR50']]
}
else:
model_score = 0
epoch_eval_results["epoch_res"][epoch] = {
"score": model_score,
"metrics": [0, 0, 0, 0]
}

if model_score > best_score:
best_score = model_score
best_model = model_name
if res1 and res2:
best_metrics = [res1['AP50'], res1['AR50'], res2['AP50'], res2['AR50']]
elif res1:
best_metrics = [res1['AP50'], res1['AR50']]

print("best epoch:", best_model, "metrics:", best_metrics)

epoch_eval_results["best_score"] = best_score
epoch_eval_results["best_epoch"] = int(best_model[5:-3])
epoch_eval_results["best_metrics"] = best_metrics
with open(os.path.join(os.path.dirname(args.weight), "epoch_eval_results.json"), "w") as f:
f.write(json.dumps(epoch_eval_results, indent=2))
8 changes: 8 additions & 0 deletions assets/validation/modules/yolov8/eval_mfd_1888.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
weight=$1
imsize=${2:-1888}

srun -p s2_bigdata --gres=gpu:1 --async \
~/anaconda3/envs/yolov8/bin/python eval_mfd.py \
--weight ${weight} --imsize ${imsize} --cfg1 opendata.yaml

rm batchscript*
Loading

1 comment on commit 097374b

@Lmillan123
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [ ]

Please sign in to comment.