Add README.md for Document Intelligence (PaddlePaddle#3498)

* Update README.md * Update README.md * Update README.md * Update README.md * Update README.md
Macvh · Oct 18, 2022 · 0a402ca · 0a402ca
1 parent a12481f
commit 0a402ca
Show file tree

Hide file tree

Showing 33 changed files with 319 additions and 62 deletions.
diff --git a/applications/doc_vqa/Extraction/run_test.sh b/applications/doc_vqa/Extraction/run_test.sh
diff --git a/applications/doc_vqa/Extraction/run_train.sh b/applications/doc_vqa/Extraction/run_train.sh
diff --git a/applications/doc_vqa/run_test.sh b/applications/doc_vqa/run_test.sh
diff --git a/applications/document_intelligence/README.md b/applications/document_intelligence/README.md
@@ -0,0 +1,186 @@
+# 文档智能应用
+
+**目录**
+- [1. 文档智能应用简介](#文档智能应用简介)
+- [2. 技术特色介绍](#技术特色介绍)
+  - [2.1 多语言跨模态训练基座](#多语言跨模态训练基座)
+  - [2.2 多场景覆盖](#多场景覆盖)
+- [3. 快速开始](#快速开始)
+  - [3.1 开箱即用](#开箱即用)
+  - [3.2 产业级流程方案](#产业级流程方案)
+
+## 1. 文档智能应用简介
+
+文档智能（DI, Document Intelligence）主要指**对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息，通过人工智能技术进行理解、分类、提取以及信息归纳**的过程。文档智能技术广泛应用于金融、保险、能源、物流、医疗等行业，常见的应用场景包括财务报销单、招聘简历、企业财报、合同文书、动产登记证、法律判决书、物流单据等多模态文档的关键信息抽取、文档解析、文档比对等。
+
+在实际应用中，需要解决文档格式繁杂、布局多样、信息模态多样、需求开放、业务数据少等多重难题。针对文档智能领域的痛点和难点，PaddleNLP将持续开源一系列产业实践范例，解决开发者们实际应用难题。
+
+<div align="center">
+    <img width="1000" height="270" alt="文档智能技术一般流程" src="https://user-images.githubusercontent.com/40840292/196361583-6b1c66d1-6a9b-4193-949a-71e2d420a82a.png">
+</div>
+
+<a name="技术特色介绍"></a>
+
+## 2. 技术特色介绍
+
+<a name="多语言跨模态训练基座"></a>
+
+### 2.1 多语言跨模态训练基座
+
+近期，百度文心文档智能，基于多语言跨模态布局增强的文档智能大模型[ERNIE-Layout](http://arxiv.org/abs/2210.06155)，刷新了五类11项文档智能任务效果。依托文心ERNIE大模型，基于布局知识增强技术，融合文本、图像、布局等信息进行联合建模，能够对多模态文档（如文档图片、PDF 文件、扫描件等）进行深度理解与分析，为各类上层应用提供SOTA模型底座。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/196373896-597f6178-4c78-41a1-bb12-796546644b32.png width="600"/>
+</div>
+
+<a name="多场景覆盖"></a>
+
+### 2.2 多场景覆盖
+
+以下是文档智能技术的一些应用场景展示：
+
+- 发票抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png height=350 width=1000 hspace='10'/>
+</div>
+
+- 海报抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png height=600 width=1000 hspace='15'/>
+</div>
+
+- 网页抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- 表格抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- 试卷抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png height=700 width=1000 hspace='10'/>
+</div>
+
+
+- 英文票据多语种（中、英、日、泰、西班牙、俄语）抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png height=400 width=1000 hspace='15'/>
+</div>
+
+- 中文票据多语种（中简、中繁、英、日、法语）抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png height=350 width=1000 hspace='15'/>
+</div>
+
+- Demo图片可在此[下载](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip)
+
+<a name="快速开始"></a>
+
+## 3. 快速开始
+
+<a name="开箱即用"></a>
+
+### 3.1 开箱即用
+
+开源DocPrompt开放文档抽取问答模型，以ERNIE-Layout为底座，可精准理解图文信息，推理学习附加知识，准备捕捉图片、PDF等多模态文档中的每个细节。
+
+🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)体验DocPrompt功能：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195749427-864d7744-1fd1-455e-99c6-53a260776483.jpg height=700 width=1100 hspace='10'/>
+</div>
+
+#### Taskflow
+
+通过``paddlenlp.Taskflow``三行代码调用DocPrompt功能，具备多语言文档抽取问答能力，部分应用场景展示如下：
+
+- 输入格式
+
+```
+[
+  {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
+  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+]
+```
+
+默认使用PaddleOCR进行OCR识别，同时支持用户通过``word_boxes``传入自己的OCR结果，格式为``List[str, List[float, float, float, float]]``。
+
+```
+[
+  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
+]
+```
+
+- 支持单条、批量预测
+
+  - 支持本地图片路径输入
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png height=800 hspace='20'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
+  [{'prompt': '五百丁本次想要担任的是什么职位?',
+    'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
+  {'prompt': '五百丁是在哪里上的大学?',
+    'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
+  {'prompt': '大学学的是什么专业?',
+    'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]
+  ```
+
+  - http图片链接输入
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg height=400 hspace='10'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}]))
+  [{'prompt': '发票号码是多少?',
+    'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},
+  {'prompt': '校验码是多少?',
+    'result': [{'end': 233,
+                'prob': 1.0,
+                'start': 231,
+                'value': '01107 555427109891646'}]}]
+  ```
+
+- 可配置参数说明
+  * `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+  * `lang`：选择PaddleOCR的语言，`ch`可在中英混合的图片中使用，`en`在英文图片上的效果更好，默认为`ch`。
+  * `topn`: 如果模型识别出多个结果，将返回前n个概率值最高的结果，默认为1。
+
+<a name="产业级流程方案"></a>
+
+### 3.2 产业级流程方案
+
+针对文档智能领域的痛点和难点，PaddleNLP将持续开源一系列文档智能产业实践范例，解决开发者们实际应用难题。
+
+- 👉 [汽车说明书跨模态智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/document_intelligence/doc_vqa#readme)
+
+## References
+
+- [文档智能：数据集、模型和应用](http://jcip.cipsc.org.cn/CN/abstract/abstract3331.shtml)
+
+- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)
diff --git a/applications/doc_vqa/.gitignore → .../document_intelligence/doc_vqa/.gitignore b/applications/doc_vqa/.gitignore → .../document_intelligence/doc_vqa/.gitignore
diff --git a/...tions/doc_vqa/Extraction/change_to_mrc.py → ...gence/doc_vqa/Extraction/change_to_mrc.py b/...tions/doc_vqa/Extraction/change_to_mrc.py → ...gence/doc_vqa/Extraction/change_to_mrc.py
diff --git a/applications/doc_vqa/Extraction/docvqa.py → ...intelligence/doc_vqa/Extraction/docvqa.py b/applications/doc_vqa/Extraction/docvqa.py → ...intelligence/doc_vqa/Extraction/docvqa.py
diff --git a/applications/doc_vqa/Extraction/model.py → ..._intelligence/doc_vqa/Extraction/model.py b/applications/doc_vqa/Extraction/model.py → ..._intelligence/doc_vqa/Extraction/model.py
diff --git a/...ications/doc_vqa/Extraction/run_docvqa.py → ...lligence/doc_vqa/Extraction/run_docvqa.py b/...ications/doc_vqa/Extraction/run_docvqa.py → ...lligence/doc_vqa/Extraction/run_docvqa.py
diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_test.sh b/applications/document_intelligence/doc_vqa/Extraction/run_test.sh
@@ -0,0 +1,38 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+QUESTION=$1
+
+python3 change_to_mrc.py ${QUESTION}
+
+python3  ./run_docvqa.py \
+    --model_name_or_path "layoutxlm-base-uncased" \
+    --max_seq_len 512 \
+    --do_test true \
+	--test_file "data/demo_test.json" \
+	--num_train_epochs 100 \
+    --eval_steps 6000 \
+    --save_steps 6000 \
+    --output_dir "output/" \
+    --save_path "data/decode_res.json" \
+	--init_checkpoint "./checkpoints/layoutxlm/" \
+    --learning_rate 3e-5 \
+    --warmup_steps 12000 \
+    --per_gpu_train_batch_size 4 \
+    --per_gpu_eval_batch_size 1 \
+    --seed 2048
+
+python3 view.py
diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_train.sh b/applications/document_intelligence/doc_vqa/Extraction/run_train.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python3 ./run_docvqa.py \
+    --model_name_or_path "layoutxlm-base-uncased" \
+    --max_seq_len 512 \
+    --train_file "data/train.json" \
+    --init_checkpoint "checkpoints/base_model" \
+	--do_train true \
+    --num_train_epochs 50 \
+    --eval_steps 24000 \
+    --save_steps 40 \
+    --output_dir "output" \
+    --save_path "data/decode_res.json" \
+    --learning_rate 3e-5 \
+    --warmup_steps 40 \
+    --per_gpu_train_batch_size 4 \
+    --per_gpu_eval_batch_size 4 \
+    --seed 2048
diff --git a/applications/doc_vqa/Extraction/view.py → ...t_intelligence/doc_vqa/Extraction/view.py b/applications/doc_vqa/Extraction/view.py → ...t_intelligence/doc_vqa/Extraction/view.py
diff --git a/...ations/doc_vqa/OCR_process/ocr_process.py → ...igence/doc_vqa/OCR_process/ocr_process.py b/...ations/doc_vqa/OCR_process/ocr_process.py → ...igence/doc_vqa/OCR_process/ocr_process.py
diff --git a/applications/doc_vqa/README.md → ...s/document_intelligence/doc_vqa/README.md b/applications/doc_vqa/README.md → ...s/document_intelligence/doc_vqa/README.md
diff --git a/...ations/doc_vqa/Rerank/change_to_rerank.py → ...igence/doc_vqa/Rerank/change_to_rerank.py b/...ations/doc_vqa/Rerank/change_to_rerank.py → ...igence/doc_vqa/Rerank/change_to_rerank.py
diff --git a/...onfig/ernie_base_1.0_CN/ernie_config.json → ...onfig/ernie_base_1.0_CN/ernie_config.json b/...onfig/ernie_base_1.0_CN/ernie_config.json → ...onfig/ernie_base_1.0_CN/ernie_config.json
diff --git a/...Rerank/config/ernie_base_1.0_CN/vocab.txt → ...Rerank/config/ernie_base_1.0_CN/vocab.txt b/...Rerank/config/ernie_base_1.0_CN/vocab.txt → ...Rerank/config/ernie_base_1.0_CN/vocab.txt
diff --git a/applications/doc_vqa/Rerank/run_test.sh → ...t_intelligence/doc_vqa/Rerank/run_test.sh b/applications/doc_vqa/Rerank/run_test.sh → ...t_intelligence/doc_vqa/Rerank/run_test.sh
@@ -1,5 +1,19 @@
 #!/bin/bash
 
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export CUDA_VISIBLE_DEVICES=0
 
 QUESTION=$1

diff --git a/applications/doc_vqa/Rerank/run_train.sh → ..._intelligence/doc_vqa/Rerank/run_train.sh b/applications/doc_vqa/Rerank/run_train.sh → ..._intelligence/doc_vqa/Rerank/run_train.sh
@@ -1,4 +1,19 @@
 #!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export CUDA_VISIBLE_DEVICES=0
 
 if [ $# != 4 ];then

diff --git a/applications/doc_vqa/Rerank/src/batching.py → ...telligence/doc_vqa/Rerank/src/batching.py b/applications/doc_vqa/Rerank/src/batching.py → ...telligence/doc_vqa/Rerank/src/batching.py
diff --git a/...tions/doc_vqa/Rerank/src/cross_encoder.py → ...gence/doc_vqa/Rerank/src/cross_encoder.py b/...tions/doc_vqa/Rerank/src/cross_encoder.py → ...gence/doc_vqa/Rerank/src/cross_encoder.py
diff --git a/...tions/doc_vqa/Rerank/src/finetune_args.py → ...gence/doc_vqa/Rerank/src/finetune_args.py b/...tions/doc_vqa/Rerank/src/finetune_args.py → ...gence/doc_vqa/Rerank/src/finetune_args.py
diff --git a/...ations/doc_vqa/Rerank/src/index_search.py → ...igence/doc_vqa/Rerank/src/index_search.py b/...ations/doc_vqa/Rerank/src/index_search.py → ...igence/doc_vqa/Rerank/src/index_search.py
diff --git a/applications/doc_vqa/Rerank/src/merge.py → ..._intelligence/doc_vqa/Rerank/src/merge.py b/applications/doc_vqa/Rerank/src/merge.py → ..._intelligence/doc_vqa/Rerank/src/merge.py
diff --git a/...cations/doc_vqa/Rerank/src/model/ernie.py → ...ligence/doc_vqa/Rerank/src/model/ernie.py b/...cations/doc_vqa/Rerank/src/model/ernie.py → ...ligence/doc_vqa/Rerank/src/model/ernie.py
diff --git a/...a/Rerank/src/model/transformer_encoder.py → ...a/Rerank/src/model/transformer_encoder.py b/...a/Rerank/src/model/transformer_encoder.py → ...a/Rerank/src/model/transformer_encoder.py
diff --git a/...ations/doc_vqa/Rerank/src/optimization.py → ...igence/doc_vqa/Rerank/src/optimization.py b/...ations/doc_vqa/Rerank/src/optimization.py → ...igence/doc_vqa/Rerank/src/optimization.py
diff --git a/applications/doc_vqa/Rerank/src/reader_ce.py → ...elligence/doc_vqa/Rerank/src/reader_ce.py b/applications/doc_vqa/Rerank/src/reader_ce.py → ...elligence/doc_vqa/Rerank/src/reader_ce.py
diff --git a/...ations/doc_vqa/Rerank/src/tokenization.py → ...igence/doc_vqa/Rerank/src/tokenization.py b/...ations/doc_vqa/Rerank/src/tokenization.py → ...igence/doc_vqa/Rerank/src/tokenization.py
diff --git a/applications/doc_vqa/Rerank/src/train_ce.py → ...telligence/doc_vqa/Rerank/src/train_ce.py b/applications/doc_vqa/Rerank/src/train_ce.py → ...telligence/doc_vqa/Rerank/src/train_ce.py
diff --git a/...ications/doc_vqa/Rerank/src/utils/args.py → ...lligence/doc_vqa/Rerank/src/utils/args.py b/...ications/doc_vqa/Rerank/src/utils/args.py → ...lligence/doc_vqa/Rerank/src/utils/args.py
diff --git a/...ications/doc_vqa/Rerank/src/utils/init.py → ...lligence/doc_vqa/Rerank/src/utils/init.py b/...ications/doc_vqa/Rerank/src/utils/init.py → ...lligence/doc_vqa/Rerank/src/utils/init.py
diff --git a/applications/document_intelligence/doc_vqa/run_test.sh b/applications/document_intelligence/doc_vqa/run_test.sh
@@ -0,0 +1,34 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+QUESTION=$1
+
+# Question: NFC咋开门
+
+if [ $# != 1 ];then
+    echo "USAGE: sh script/run_cross_encoder_test.sh \$QUESTION"
+    exit 1
+fi
+
+# compute scores for QUESTION and OCR parsing results  with Rerank module
+cd Rerank
+bash run_test.sh ${QUESTION}
+cd ..
+
+# extraction answer for QUESTION from the top1 of rank
+cd Extraction
+bash run_test.sh ${QUESTION}
+cd ..