Skip to content

feat: add parameter to choose of which pages export the doctags #290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docling_core/types/doc/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -3885,6 +3885,7 @@
add_table_cell_location: bool = False,
add_table_cell_text: bool = True,
minified: bool = False,
pages: Optional[set[int]] = None,

Check warning on line 3888 in docling_core/types/doc/document.py

View check run for this annotation

Codecov / codecov/patch

docling_core/types/doc/document.py#L3888

Added line #L3888 was not covered by tests
) -> str:
r"""Exports the document content to a DocumentToken format.

Expand All @@ -3903,6 +3904,7 @@
:param # table specific flagsadd_table_cell_location: bool
:param add_table_cell_text: bool: (Default value = True)
:param minified: bool: (Default value = False)
:param pages: set[int]: (Default value = None)
:returns: The content of the document formatted as a DocTags string.
:rtype: str
"""
Expand All @@ -3927,6 +3929,7 @@
add_page_break=add_page_index,
add_table_cell_location=add_table_cell_location,
add_table_cell_text=add_table_cell_text,
pages=pages,
mode=(
DocTagsParams.Mode.MINIFIED
if minified
Expand Down
52 changes: 52 additions & 0 deletions test/data/doc/2206.01062.yaml.pages.dt
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
<doctag><page_header><loc_15><loc_104><loc_30><loc_350>arXiv:2206.01062v1 [cs.CV] 2 Jun 2022</page_header>
<section_header_level_1><loc_88><loc_53><loc_413><loc_75>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</section_header_level_1>
<text><loc_74><loc_85><loc_158><loc_114>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</text>
<text><loc_208><loc_85><loc_292><loc_114>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</text>
<text><loc_343><loc_85><loc_426><loc_114>Michele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com</text>
<text><loc_141><loc_122><loc_225><loc_151>Ahmed S. Nassar IBM Research Rueschlikon, Switzerland ahn@zurich.ibm.com</text>
<text><loc_275><loc_122><loc_359><loc_151>Peter Staar IBM Research Rueschlikon, Switzerland taa@zurich.ibm.com</text>
<section_header_level_1><loc_44><loc_157><loc_91><loc_163>ABSTRACT</section_header_level_1>
<text><loc_44><loc_166><loc_241><loc_337>Accurate document layout analysis is a key requirement for highquality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present DocLayNet , a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNettrained models are more robust and thus the preferred choice for general-purpose document-layout analysis.</text>
<section_header_level_1><loc_44><loc_348><loc_110><loc_354>CCS CONCEPTS</section_header_level_1>
<text><loc_44><loc_356><loc_243><loc_377>· Informationsystems → Documentstructure ; · Appliedcomputing → Document analysis ; · Computing methodologies → Machine learning Computer vision ; ; Object detection ;</text>
<text><loc_44><loc_401><loc_241><loc_425>Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).</text>
<text><loc_44><loc_426><loc_162><loc_430>KDD '22, August 14-18, 2022, Washington, DC, USA</text>
<text><loc_44><loc_432><loc_153><loc_436>© 2022 Copyright held by the owner/author(s).</text>
<text><loc_44><loc_437><loc_128><loc_441>ACM ISBN 978-1-4503-9385-0/22/08.</text>
<text><loc_44><loc_442><loc_136><loc_446>https://doi.org/10.1145/3534678.3539043</text>
<picture><loc_264><loc_158><loc_452><loc_332><caption><loc_260><loc_341><loc_457><loc_353>Figure 1: Four examples of complex page layouts across different document categories</caption></picture>
<section_header_level_1><loc_260><loc_375><loc_310><loc_380>KEYWORDS</section_header_level_1>
<text><loc_260><loc_384><loc_457><loc_396>PDF document conversion, layout segmentation, object-detection, data set, Machine Learning</text>
<section_header_level_1><loc_260><loc_405><loc_331><loc_409>ACMReference Format:</section_header_level_1>
<text><loc_260><loc_411><loc_457><loc_446>Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: A Large Human-Annotated Dataset for DocumentLayout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 3534678.3539043</text>
<page_break>
<page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
<section_header_level_1><loc_44><loc_55><loc_128><loc_61>1 INTRODUCTION</section_header_level_1>
<text><loc_44><loc_70><loc_248><loc_144>Despite the substantial improvements achieved with machine-learning (ML) approaches and deep neural networks in recent years, document conversion remains a challenging problem, as demonstrated by the numerous public competitions held on this topic [1-4]. The challenge originates from the huge variability in PDF documents regarding layout, language and formats (scanned, programmatic or a combination of both). Engineering a single ML model that can be applied on all types of documents and provides high-quality layout segmentation remains to this day extremely challenging [5]. To highlight the variability in document layouts, we show a few example documents from the DocLayNet dataset in Figure 1.</text>
<text><loc_44><loc_146><loc_241><loc_317>Akeyproblem in the process of document conversion is to understand the structure of a single document page, i.e. which segments of text should be grouped together in a unit. To train models for this task, there are currently two large datasets available to the community, PubLayNet [6] and DocBank [7]. They were introduced in 2019 and 2020 respectively and significantly accelerated the implementation of layout detection and segmentation models due to their sizes of 300K and 500K ground-truth pages. These sizes were achieved by leveraging an automation approach. The benefit of automated ground-truth generation is obvious: one can generate large ground-truth datasets at virtually no cost. However, the automation introduces a constraint on the variability in the dataset, because corresponding structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents such as technical manuals, annual company reports, legal text, government tenders, etc. have very different and partially unique layouts. As a consequence, the layout predictions obtained from models trained on PubLayNet or DocBank is very reasonable when applied on scientific documents. However, for more artistic or free-style layouts, we see sub-par prediction quality from these models, which we demonstrate in Section 5.</text>
<text><loc_44><loc_319><loc_241><loc_366>In this paper, we present the DocLayNet dataset. It provides pageby-page layout annotation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique document pages, of which a fraction carry double- or triple-annotations. DocLayNet is similar in spirit to PubLayNet and DocBank and will likewise be made available to the public 1 in order to stimulate the document-layout analysis community. It distinguishes itself in the following aspects:</text>
<unordered_list><list_item><loc_53><loc_369><loc_241><loc_388>(1) Human Annotation : In contrast to PubLayNet and DocBank, we relied on human annotation instead of automation approaches to generate the data set.</list_item>
<list_item><loc_53><loc_390><loc_240><loc_402>(2) Large Layout Variability : We include diverse and complex layouts from a large variety of public sources.</list_item>
<list_item><loc_53><loc_404><loc_241><loc_423>(3) Detailed Label Set : We define 11 class labels to distinguish layout features in high detail. PubLayNet provides 5 labels; DocBank provides 13, although not a superset of ours.</list_item>
<list_item><loc_53><loc_425><loc_241><loc_437>(4) Redundant Annotations : A fraction of the pages in the DocLayNet data set carry more than one human annotation.</list_item>
</unordered_list>
<footnote><loc_44><loc_442><loc_176><loc_447>1 https://developer.ibm.com/exchanges/data/all/doclaynet</footnote>
<text><loc_279><loc_55><loc_456><loc_67>This enables experimentation with annotation uncertainty and quality control analysis.</text>
<unordered_list><list_item><loc_269><loc_69><loc_457><loc_102>(5) Pre-defined Train-, Test- & Validation-set : Like DocBank, we provide fixed train-, test- & validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.</list_item>
</unordered_list>
<text><loc_259><loc_106><loc_457><loc_139>All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.</text>
<text><loc_260><loc_141><loc_457><loc_194>In Section 5, we will present baseline accuracy numbers for a variety of object detection methods (Faster R-CNN, Mask R-CNN and YOLOv5) trained on DocLayNet. We further show how the model performance is impacted by varying the DocLayNet dataset size, reducing the label set and modifying the train/test-split. Last but not least, we compare the performance of models trained on PubLayNet, DocBank and DocLayNet and demonstrate that a model trained on DocLayNet provides overall more robust layout recovery.</text>
<section_header_level_1><loc_260><loc_203><loc_345><loc_209>2 RELATED WORK</section_header_level_1>
<text><loc_259><loc_219><loc_457><loc_293>While early approaches in document-layout analysis used rulebased algorithms and heuristics [8], the problem is lately addressed with deep learning methods. The most common approach is to leverage object detection models [9-15]. In the last decade, the accuracy and speed of these models has increased dramatically. Furthermore, most state-of-the-art object detection methods can be trained and applied with very little work, thanks to a standardisation effort of the ground-truth data format [16] and common deep-learning frameworks [17]. Reference data sets such as PubLayNet [6] and DocBank provide their data in the commonly accepted COCO format [16].</text>
<text><loc_260><loc_295><loc_457><loc_348>Lately, new types of ML models for document-layout analysis have emerged in the community [18-21]. These models do not approach the problem of layout analysis purely based on an image representation of the page, as computer vision methods do. Instead, they combine the text tokens and image representation of a page in order to obtain a segmentation. While the reported accuracies appear to be promising, a broadly accepted data format which links geometric and textual features has yet to establish.</text>
<section_header_level_1><loc_260><loc_357><loc_390><loc_363>3 THE DOCLAYNET DATASET</section_header_level_1>
<text><loc_260><loc_373><loc_457><loc_426>DocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape of labeled, rectangular boundingboxes. We define 11 distinct labels for layout features, namely Caption , Footnote , Formula List-item , , Page-footer , Page-header , Picture , Section-header , Table , Text , and Title . Our reasoning for picking this particular label set is detailed in Section 4.</text>
<text><loc_260><loc_428><loc_456><loc_447>In addition to open intellectual property constraints for the source documents, we required that the documents in DocLayNet adhere to a few conditions. Firstly, we kept scanned documents</text>
<page_break>
<page_break>
<page_break>
<page_break>
<page_break>
<page_break>
<page_break>
</doctag>
9 changes: 9 additions & 0 deletions test/test_docling_doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,15 @@ def _test_export_methods(
dt_min_pred = doc.export_to_doctags(minified=True)
_verify_regression_test(dt_min_pred, filename=filename, ext="min.dt")

# Test pages parameter in DocTags export
if doc.pages: # Only test if document has pages
first_page = min(doc.pages.keys())
second_page = first_page + 1
if second_page in doc.pages: # Only test if document has at least 2 pages
dt_pages_pred = doc.export_to_doctags(pages={first_page, second_page})
print(dt_pages_pred)
_verify_regression_test(dt_pages_pred, filename=filename, ext="pages.dt")

# Test Tables export ...
for table in doc.tables:
table.export_to_markdown()
Expand Down
Loading