XDoc is a unified pre-trained model that deals with different document formats in a single model. With only 36.7% parameters, XDoc achieves comparable or better performance on downstream tasks, which is cost-effective for real-world deployment.
XDoc: Unified Pre-training for Cross-Format Document Understanding Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei, EMNLP 2022
The overview of our framework is as follows:
Model | Download |
---|---|
xdoc-pretrain-roberta-1M | xdoc-base |
Model | Download |
---|---|
xdoc-squad1.1 | xdoc-squad1.1 |
xdoc-squad2.0 | xdoc-squad2.0 |
xdoc-funsd | xdoc-funsd |
xdoc-websrc | xdoc-websrc |
The dataset will be automatically downloaded. Please refer to ./fine_tuning/squad/
.
pip install -r requirements.txt
To train XDoc on SQuADv1.1
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./v1_result \
--overwrite_output_dir
To train XDoc on SQuADv2.0
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base \
--dataset_name squad_v2 \
--do_train \
--do_eval \
--version_2_with_negative \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./v2_result \
--overwrite_output_dir
To test XDoc on SQuADv1.1
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base-squad1.1 \
--dataset_name squad \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./squadv1.1_result \
--overwrite_output_dir
To test XDoc on SQuADv2.0
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base-squad2.0 \
--dataset_name squad_v2 \
--do_eval \
--version_2_with_negative \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./squadv2.0_result \
--overwrite_output_dir
The dataset will be automatically downloaded. Please refer to ./fine_tuning/funsd/
.
pip install -r requirements.txt
Also, you need to install detectron2
. For example, if you use torch1.8 with cuda version 10.1, you can use the following command
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
--model_name_or_path microsoft/xdoc-base \
--output_dir camera_ready_funsd_1M \
--do_train \
--do_eval \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16 \
--overwrite_output_dir \
--seed 42
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
--model_name_or_path microsoft/xdoc-base-funsd \
--output_dir camera_ready_funsd_1M \
--do_eval \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16 \
--overwrite_output_dir \
--seed 42
The dataset will be manually downloaded. After downloading, please modify the argument --web_train_file
, --web_eval_file
, web_root_dir
, and root_dir
in args.py.
pip install -r requirements.txt
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train True --do_eval True --model_name_or_path microsoft/xdoc-base
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train False --do_eval True --model_name_or_path microsoft/xdoc-base-websrc
- To verify the model accuracy, we select the GLUE benchmark and SQuAD to evaluate plain text understanding, FUNSD and DocVQA to evaluate doc- ument understanding, and WebSRC for web text understanding. Experimental results have demonstrated that XDoc achieves comparable or even better performance on these tasks.
Model | MNLI-m | QNLI | SST2 | MRPC | SQUAD1.1/2.0 | FUNSD | DocVQA | WebSRC |
---|---|---|---|---|---|---|---|---|
RoBERTa | 87.6 | 92.8 | 94.8 | 90.2 | 92.2/83.4 | - | - | - |
LayoutLM | - | - | - | - | - | 79.3 | 69.2 | - |
MarkupLM | - | - | - | - | - | - | - | 74.5 |
XDoc(Ours) | 86.8 | 92.3 | 95.3 | 91.1 | 92.0/83.5 | 89.4 | 72.7 | 74.8 |
- With only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment.
Model | Word | 1D Position | Transformer | 2D Position | XPath | Adaptive | Total |
---|---|---|---|---|---|---|---|
RoBERTa | √ | √ | √ | - | - | - | 128M |
LayoutLM | √ | √ | √ | √ | - | - | 131M |
MarkupLM | √ | √ | √ | - | √ | - | 139M |
XDoc(Ours) | √ | √ | √ | √ | √ | √ | 146M |
If you find XDoc helpful, please cite us:
@article{chen2022xdoc,
title={XDoc: Unified Pre-training for Cross-Format Document Understanding},
author={Chen, Jingye and Lv, Tengchao and Cui, Lei and Zhang, Cha and Wei, Furu},
journal={arXiv preprint arXiv:2210.02849},
year={2022}
}
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers. Microsoft Open Source Code of Conduct
For help or issues using XDoc, please submit a GitHub issue.
For other communications, please contact Lei Cui (lecu@microsoft.com
), Furu Wei (fuwei@microsoft.com
).