Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation for "table" feature #2599

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion changes.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
Change Log
==========


**Changes in version 1.23.0rc2 (2023-08-18)**

* Contains a new "rebased" implementation of PyMuPDF.
Expand Down
44 changes: 44 additions & 0 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ In a nutshell, this is what you can do with PyMuPDF:
:meth:`Page.draw_sector` PDF only: draw a circular sector
:meth:`Page.draw_squiggle` PDF only: draw a squiggly line
:meth:`Page.draw_zigzag` PDF only: draw a zig-zagged line
:meth:`Page.find_tables` locate tables on the page
:meth:`Page.get_drawings` get vector graphics on page
:meth:`Page.get_fonts` PDF only: get list of referenced fonts
:meth:`Page.get_image_bbox` PDF only: get bbox and matrix of embedded image
Expand Down Expand Up @@ -347,6 +348,49 @@ In a nutshell, this is what you can do with PyMuPDF:
.. image:: images/img-markers.*
:scale: 100

.. method:: find_tables(clip=None, vertical_strategy="lines", horizontal_strategy="lines", vertical_lines=None, horizontal_lines=None, snap_tolerance=3, snap_x_tolerance=None, snap_y_tolerance=None, join_tolerance=3, join_x_tolerance=None, join_y_tolerance=None, edge_min_length=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=3, intersection_x_tolerance=None, intersection_y_tolerance=None, text_tolerance=3, text_x_tolerance=3, text_y_tolerance=3)

* New in version 1.23.0

Find tables on the page and return an object with related information. Typically, only very few of the many arguments ever need to be specified -- they mainly are tools to react to corner case situations.

:arg rect_like clip: specify a region to consider within the page rectangle. Default is the full page.
:arg list horizontal_lines: floats containing the y-coordinates of rows. If provided, there will be no attempt to identify additional table rows.
:arg list vertical_lines: floats containing the x-coordinates of columns. If provided, there will be no attempt to identify additional table columns.
:arg str vertical_strategy: request a search algorithm. The "lines" default looks for vector drawings. If "text" is specified, text positions are used to generate "virtual" column boundaries. Use `min_words_vertical` to request the number of words for considering their x-coordinate.
:arg str horizontal_strategy: request a search algorithm. The "lines" default looks for vector drawings. If "text" is specified, text positions are used to generate "virtual" row boundaries. The "text" choices are recommended when dealing with pages without any vector graphics -- like when this is an OCRed page.
:arg int min_words_vertical: relevant for vertical strategy option "text": at least this many words must coincide to establish a virtual column boundary.
:arg int min_words_horizontal: relevant for horizontal strategy option "text": at least this many words must coincide to establish a virtual row boundary.

The remaining parameters are limits for merging different objects. For instance: Two horizontal lines with the same x-coordinates and a vertical distance less than 3 will be merged ("snapped") to one line.

:returns: a `TableFinder` object that has the following significant attributes:

* **cells:** a list of **all bboxes** on the page, that have been identified as table cells (across all tables). Each cell is a tuple `(x0, y0, x1, y1)` of coordinates or `None`.
* **tables:** a list of `Table` objects. This is `[]` if the page has no tables. Please note that while single tables can be found as items of this list, the `TableFinder` object itself is also a sequence of it tables. This means that if `tabs` is a `TableFinder` object, then table number "n" is delivered by `tabs.tables[n]` as well as by the shorter `tabs[n]`.


* The `Table` object has the following attributes:

* **bbox:** the bounding box of the table as a tuple `(x0, y0, x1, y1)`.
* **cells:** bounding boxes of the table's cells (list of tuples). A cell may also be `None`.
* **extract():** this method returns the text content of each table cell as a list of list of strings.
* **to_pandas():** this method returns the table as a `pandas <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_.
* **header:** a `TableHeader` object containing header information of the table.
* **col_count:** an integer containing the number of table columns.
* **row_count:** an integer containing the number of table rows.
* **rows:** a list of `TableRow` objects containing two attributes: *bbox* is the boundary box of the row, and *cells* is a list of table cells contained in this row.

* The `TableHeader` object has the following attributes:

* **bbox:** the bounding box of the header.
* **cells:** a list of bounding boxes containing the name of the respective column.
* **names:** a list of strings containing the text of each of the cell bboxes. They represent the column names -- which can be used when exporting the table to pandas DataFrames or CSV, etc.
* **external:** a bool indicating whether the header bbox is outside the table body (`True`) or not. Table headers are never identified by the `TableFinder` logic. Therefore, if *external* is true, then the header cells are not part of any cell identified by `TableFinder`. If `external == False`, then the first table row is the header.

Please have a look at example `Jupyter notebooks <https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis>`_, which cover standard situations like multiple tables on one page or joining table fragments across multiple pages.


.. method:: add_stamp_annot(rect, stamp=0)

PDF only: Add a "rubber stamp" like annotation to e.g. indicate the document's intended use ("DRAFT", "CONFIDENTIAL", etc.).
Expand Down
12 changes: 8 additions & 4 deletions docs/recipes-text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,16 @@ You can also use the above mentioned `script <https://github.com/pymupdf/PyMuPDF
.. _RecipesText_D:

How to :index:`Extract Table Content <pair: extract; table>` from Documents
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you see a table in a document, you are not normally looking at something like an embedded Excel or other identifiable object. It usually is just text, formatted to appear as appropriate.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you see a table in a document, you are normally not looking at something like an embedded Excel or other identifiable object. It usually is just normal, standard text, formatted to appear as tabular data.

Extracting a tabular data from such a page area therefore means that you must find a way to **(1)** graphically indicate table and column borders, and **(2)** then extract text based on this information.
Extracting tabular data from such a page area therefore means that you must find a way to **identify** the table area (i.e. its boundary box), then **(1)** graphically indicate table and column borders, and **(2)** then extract text based on this information.

The wxPython GUI script `extract.py <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-table/extract.py>`_ strives to exactly do that. You may want to have a look at it and adjust it to your liking.
This can be a very complex task, depending on details like the presence or absence of lines, rectangles or other supporting vector graphics.

Method :meth:`Page.find_tables` does all that for you, with a high table detection precision. Its great advantage is that there are no external library dependencies, nor the need to employ artificial intelligence or machine learning technologies. It also provides an integrated interface to the well-known Python package for data analysis `pandas <https://pypi.org/project/pandas/>`_.

Please have a look at example `Jupyter notebooks <https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis>`_, which cover standard situations like multiple tables on one page or joining table fragments across multiple pages.
JorjMcKie marked this conversation as resolved.
Show resolved Hide resolved

----------

Expand Down
31 changes: 31 additions & 0 deletions docs/the-basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1106,8 +1106,39 @@ The following snippet creates a new :title:`PDF` and encrypts it with separate u

- :meth:`Document.save`

--------------------------



.. _The_Basics_Extracting_Tables:

Extracting Tables from a :title:`Page`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Tables can be found and extracted from any document :ref:`Page`.

.. raw:: html

<pre>
<code class="language-python" data-prismjs-copy="Copy">
import fitz
from pprint import pprint

doc = fitz.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
tabs = page.find_tables() # locate and extract any tables on page
print(f"{len(tabs.tables)} found on {page}") # display number of found tables
if tabs.tables: # at least one table found?
pprint(tabs[0].extract()) # print content of first table
</code>
</pre>


.. note::

**API reference**

- :meth:`Page.find_tables`



Expand Down
Loading