Skip to content

Commit 6b03f6f

Browse files
committed
Docs fixes
1 parent 67b0d42 commit 6b03f6f

File tree

15 files changed

+75
-77
lines changed

15 files changed

+75
-77
lines changed

dedoc/readers/pdf_reader/data_classes/tables/table_type.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
class TableTypeAdditionalOptions:
22
"""
3-
Setting up the table recognizer. The value of the parameter specifies the type of tables recognized when processed by
3+
Enum for table types of tables for the table recognizer.
4+
The value of the parameter specifies the type of tables recognized when processed by
45
class :class:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer`.
56
6-
* Parameter `table_type=wo_external_bounds` - recognize tables without external bounds;
7+
* Parameter `table_type=wo_external_bounds` - recognize tables without external bounds.
78
89
Example of a table of type `wo_external_bounds`::
910
@@ -16,7 +17,7 @@ class :class:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_
1617
text | text | text
1718
1819
19-
* Parameter `table_type=one_cell_table` - if a document contains a bounding box with text, it will be considered a table;
20+
* Parameter `table_type=one_cell_table` - if a document contains a bounding box with text, it will be considered a table.
2021
2122
Example of a page with a table of type `one_cell_table`::
2223
@@ -27,7 +28,7 @@ class :class:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_
2728
+------+
2829
________________________
2930
30-
* Parameter `table_type=split_last_column` - specified parameter for the merged last column of the table;
31+
* Parameter `table_type=split_last_column` - specified parameter for the merged last column of the table.
3132
3233
Example of a table of type `split_last_column`::
3334

dedoc/readers/pdf_reader/pdf_base_reader.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
from dedoc.readers.pdf_reader.data_classes.line_with_location import LineWithLocation
1313
from dedoc.readers.pdf_reader.data_classes.pdf_image_attachment import PdfImageAttachment
1414
from dedoc.readers.pdf_reader.data_classes.tables.scantable import ScanTable
15-
from dedoc.readers.pdf_reader.utils.header_footers_analysis import HeaderFooterDetector
15+
1616

1717
ParametersForParseDoc = namedtuple("ParametersForParseDoc", [
1818
"is_one_column_document",
@@ -44,6 +44,7 @@ def __init__(self, *, config: Optional[dict] = None, recognized_extensions: Opti
4444
from dedoc.readers.pdf_reader.pdf_image_reader.paragraph_extractor.scan_paragraph_classifier_extractor import ScanParagraphClassifierExtractor
4545
from dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.gost_frame_recognizer import GOSTFrameRecognizer
4646
from dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer import TableRecognizer
47+
from dedoc.readers.pdf_reader.utils.header_footers_analysis import HeaderFooterDetector
4748
from dedoc.readers.pdf_reader.utils.line_object_linker import LineObjectLinker
4849
from dedoc.attachments_extractors.concrete_attachments_extractors.pdf_attachments_extractor import PDFAttachmentsExtractor
4950

dedoc/readers/pdf_reader/pdf_image_reader/table_recognizer/table_recognizer.py

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,13 @@
2121

2222
class TableRecognizer:
2323
"""
24-
The class recognizes tables from document images. This class is internal to the system. It is called from readers such as .
25-
26-
* The class recognizes tables with borders from the document image and returns the class
27-
(function :meth:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer.recognize_tables_from_image`);
28-
29-
30-
* The class also analyzes recognized single-page tables and combines them into multi-page ones
31-
(function :meth:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer.convert_to_multipages_tables`);
24+
The class recognizes tables from document images. This class is internal to the system.
25+
It is called from readers such as :class:`dedoc.readers.PdfTxtlayerReader` or :class:`dedoc.readers.PdfImageReader`.
3226
27+
* The class recognizes tables with borders from the document image using
28+
:meth:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer.recognize_tables_from_image`;
29+
* The class also analyzes recognized single-page tables and combines them into multi-page ones using
30+
:meth:`~dedoc.readers.pdf_reader.pdf_image_reader.table_recognizer.table_recognizer.TableRecognizer.convert_to_multipages_tables`
3331
"""
3432

3533
def __init__(self, *, config: dict = None) -> None:

dedoc/readers/pdf_reader/utils/header_footers_analysis.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,18 @@ class HeaderFooterDetector:
1616
`Lin X. Header and footer extraction by page association //Document Recognition and Retrieval X. – SPIE, 2003. – Т. 5010. – С. 164-171.`
1717
1818
Algorithm's notes:
19-
1) For documents of 6 pages or more, lines on even and odd pages of the document are compared to detect alternating footers-headers.
20-
For documents of less than 6 pages, lines between adjacent pages (between even or odd pages) are compared.
21-
Therefore, alternating footers-headers will not be detected on documents of less than 6 pages.
2219
23-
2) The algorithm analyzes the first 4 and last 4 lines on each page of the document and,
24-
by comparing lines across pages, identifies common footer-header patterns using Levenshtein similarity.
20+
1. For documents of 6 pages or more, lines on even and odd pages of the document are compared to detect alternating footers-headers.
21+
For documents of less than 6 pages, lines between adjacent pages (between even or odd pages) are compared.
22+
Therefore, alternating footers-headers will not be detected on documents of less than 6 pages.
2523
26-
3) For the algorithm to work, the document must have at least two pages of text.
27-
It is not an ML algorithm it cannot work with just one page.
24+
2. The algorithm analyzes the first 4 and last 4 lines on each page of the document and,
25+
by comparing lines across pages, identifies common footer-header patterns using Levenshtein similarity.
2826
29-
4) The more pages the better. Remember the parameter `pages` limits the number of pages in a document.
27+
3. For algorithm work, the document must have at least two pages of text.
28+
It is not an ML algorithm so it cannot work with just one page.
3029
30+
4. The more pages the better. Remember the parameter `pages` limits the number of pages in a document.
3131
"""
3232

3333
def __init__(self) -> None:

docs/source/changelog.rst

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,17 @@
11
Changelog
22
=========
3+
34
v2.6 (2025-09-19)
45
-----------------
56
Release note: `v2.6 <https://github.com/ispras/dedoc/releases/tag/v2.6>`_
67

7-
* improve table merge algorithm (added check on table layout) `MultiPageTableExtractor`.
8-
* refactoring table merge `MultiPageTableExtractor`.
9-
* improve header footer analysis `HeaderFooterDetector`.
10-
* added header footer analysis support in Tabby.
11-
* added header footer analysis info (parameter `need_header_footer_analysis`) in documentation (readthedocs).
12-
* update to python3.10.
13-
* update to ubuntu22.04.
14-
* added `Contributing Information` (project rules, how build, how develop) in documentation (readthedocs).
15-
8+
* Improve table merge algorithm (added check on table layout) `MultiPageTableExtractor`.
9+
* Improve header footer analysis `HeaderFooterDetector`.
10+
* Added header footer analysis support in `PdfTabbyReader`.
11+
* Added header footer analysis info (parameter `need_header_footer_analysis`) in documentation.
12+
* Update to python3.10.
13+
* Update to ubuntu22.04.
14+
* Added `Contributing Information` (project rules, how to build, how to develop) in documentation.
1615

1716
v2.5 (2025-09-05)
1817
-----------------

docs/source/contributing/check_documentation.rst

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,11 @@ Check documentation
99
1010
pip install .[docs]
1111
12-
2. Documentation files should be located in the `docs/ <https://github.com/ispras/dedoc/blob/master/docs>`_ directory,
13-
which must contain the `docs/source/conf.py <https://github.com/ispras/dedoc/blob/master/docs/source/conf.py>`_ (build settings)
14-
and `docs/source/index.rst <https://github.com/ispras/dedoc/blob/master/docs/source/index.rst>`_ (documentation main page) files.
15-
16-
3. Build documentation into HTML pages is done as follows:
12+
2. Documentation files should be located in the `docs/ <https://github.com/ispras/dedoc/blob/master/docs>`_ directory.
13+
Build documentation into HTML pages is done as follows:
1714

1815
.. code-block:: bash
1916
2017
python -m sphinx -T -E -W -b html -d docs/_build/doctrees -D language=en docs/source docs/_build
18+
19+
3. After building, the documentation can be checked locally, the main built page ``docs/_build/index.html`` can be opened in the browser.

docs/source/contributing/contributing.rst

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@ Support and Contributing
55

66
Support
77
-------
8-
If you are stuck with a problem using Dedoc, please do get in touch at our `Issues <https://github.com/ispras/dedoc/issues>`_ (recommend)
8+
If you are stuck with a problem using Dedoc, please use our `Issues <https://github.com/ispras/dedoc/issues>`_ (recommended)
99
or `Dedoc Chat <https://t.me/dedoc_chat>`_. The developers are willing to help.
1010

1111
You can save time by following this procedure when reporting a problem:
1212

13-
* Do try to solve the problem on your own first. Read the documentation, including using the search feature, index and reference documentation.
13+
* Try to solve the problem on your own first. Read the documentation, including using the search feature, index and reference documentation.
1414

1515
* Search the issue archives to see if someone else already had the same problem.
1616

@@ -23,7 +23,9 @@ Contributing Rules
2323

2424
* To add new features to the project repository yourself, you should follow
2525
the `general contributing rules of github <https://github.com/firstcontributions/first-contributions>`_.
26-
In your Pull Request, set `develop` as the target branch.
26+
27+
.. note::
28+
In your Pull Request, set `develop` as the target branch.
2729

2830
* We recommend using `Pycharm IDE` and `virtualenv` package for development.
2931

@@ -34,16 +36,17 @@ Contributing Rules
3436
* We strongly recommend using the already used ML library `torch` in development. For example,
3537
using `tensorflow` library instead of `torch` is justified only in case of extreme necessity.
3638

37-
* If you add new functionality to dedoc, be sure to add python `unitests` to test the added functionality
38-
(you can add api tests in `tests/api_tests <https://github.com/ispras/dedoc/blob/master/tests/api_tests>`_,
39-
you can add unit tests in `tests/unit_tests <https://github.com/ispras/dedoc/blob/master/tests/unit_tests>`_).
39+
* If you add new functionality to dedoc, be sure to add python `unittest` to test the added functionality
40+
(you can add api tests in `tests/api_tests <https://github.com/ispras/dedoc/blob/master/tests/api_tests>`_
41+
or unit tests in `tests/unit_tests <https://github.com/ispras/dedoc/blob/master/tests/unit_tests>`_).
4042
These tests are run automatically in the Continuous Integration pipeline.
43+
To run tests locally, you can use docker as described in the `README <https://github.com/ispras/dedoc/blob/master/README.md#4-run-container-with-tests>`_.
4144

4245
* Before each commit, check the code style using the automatic checker using the `flake8` library.
43-
Instructions for using flake8 are provided here :ref:using_flake8`.
46+
Instructions for using flake8 are provided in :ref:`using_flake8`.
4447

4548
* We recommend setting up pre-commit for convenience and speeding up development according to the instructions :ref:`using_precommit` .
46-
This will run a style check of the changed code with each commit.
49+
This will run a style check of the changed code before each commit.
4750

4851
* In case of any change in the online documentation of the project (for example, when adding a new api parameter),
4952
be sure to check locally that the changed documentation is successfully built and looks as expected.
@@ -55,7 +58,3 @@ Contributing Rules
5558
using_flake8
5659
using_precommit
5760
check_documentation
58-
59-
60-
61-

docs/source/dedoc_api_usage/api.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,16 @@ Api parameters description
279279
- false
280280
- This option is used to **remove** headers and footers of PDF documents from the output result.
281281
If ``need_header_footer_analysis=false``, header and footer lines will present in the output as well as all other document lines.
282+
The algorithm is implemented and described in the class :class:`~dedoc.readers.pdf_reader.utils.header_footers_analysis.HeaderFooterDetector`.
283+
284+
* - table_type
285+
- "", wo_external_bounds, one_cell_table, split_last_column and their combinaton
286+
- ""
287+
- Setting up the table recognition method. This option is used for PDF documents which are images with text (PDF without a textual layer).
288+
It is also used for PDF documents when ``pdf_with_text_layer`` is ``true``, ``false``, ``auto`` or ``auto_tabby``.
289+
The value of the parameter specifies the type of tables for recognition,
290+
supported table types are described in :class:`~dedoc.readers.pdf_reader.data_classes.tables.table_type.TableTypeAdditionalOptions`.
291+
You can use combination of values (for example, ``wo_external_bounds+one_cell_table``).
282292

283293
* - need_binarization
284294
- true, false

docs/source/getting_started/installation.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,4 +182,4 @@ For ``python3.9``:
182182
.. code-block:: bash
183183
184184
pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torch-1.11.0a0+git137096a-cp39-cp39-linux_x86_64.whl
185-
pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torchvision-0.12.0a0%2B9b5a3fe-cp39-cp39-linux_x86_64.whl
185+
pip install https://github.com/ispras/dedockerfiles/raw/master/wheels/torchvision-0.12.0a0%2B9b5a3fe-cp39-cp39-linux_x86_64.whl

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,7 @@ This type of structure is configurable (see :ref:`using_patterns`).
278278
modules/metadata_extractors
279279
modules/structure_extractors
280280
modules/structure_constructors
281+
modules/pdf_parsing
281282

282283

283284
.. toctree::

0 commit comments

Comments
 (0)