Skip to content

Table Cell Markdown Support #4555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 80 additions & 10 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -331,7 +331,7 @@ In a nutshell, this is what you can do with PyMuPDF:
|history_end|


.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)
.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)

**PDF only**: Remove all **content** contained in any redaction rectangle on the page.

Expand Down Expand Up @@ -2338,18 +2338,28 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref

The page number "pno" is a 0-based integer `-∞ < pno < page_count`.

.. note::

Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.

However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.


.. class:: TableFinder

An object always returned by :meth:`Page.find_tables`. Attributes of interest:

... attribute:: tables
.. attribute:: tables

A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
A list of `Table` objects, each of which represents a table found on the page. Empty list if no table found.

... attribute:: page
.. attribute:: page

A reference to the :ref:`Page` object.
A reference (weakref proxy) to the owning :ref:`Page` object.

.. attribute:: cells

A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of all table cells (in any tables) found on the page. Note that cells may also be ``None`` objects, which are created to enforce a complete rows x columns structure for the affected table.


.. class:: Table
Expand All @@ -2360,25 +2370,85 @@ The page number "pno" is a 0-based integer `-∞ < pno < page_count`.

The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.



.. attribute:: cells

A list of tuples `(x0, y0, x1, y1)` representing the bounding boxes of the cells in the table. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.

.. attribute:: rows

A list of :ref:`TableRow` objects, each of which represents a row in the table. The order of rows is the same as in the original table. If the table has no rows, this will be an empty list.

.. attribute:: col_count

The number of columns in the table (integer).

.. attribute:: row_count

The number of rows in the table (integer).

.. method:: extract

Returns a (row-major) list of lists representing the plain text of the table cells. Each sublist contains the text of one row, and each item in that sublist is the text of one cell in that row. So, `Table.extract()[i][j]` will return the text of the cell in row ``i`` and column ``j``. If a cell is empty, the corresponding item will be an empty string. If the corresponding boundary box is ``None``, the item will also be ``None``.

.. method:: to_markdown(clean=False, fill_empty=True)

Returns a string in `GitHub Markdown format <https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables>`_ representing the table. The string will contain a header line with column names, followed by a separator line, and then the rows of the table. The text of each cell will be enclosed in pipe characters `|`, and each row will be separated by a newline character `\n`. **Line breaks inside a cell** are being replaced by the HTML `<br>` tag. Bold, italic, mono-spaced and strikethrough text will be styled according to the corresponding Markdown syntax.

- Bold text will be enclosed in double asterisks ``"**"``.

- Italic text will be enclosed in single underscore ``"_"``.

- Mono-spaced text will be enclosed in backticks ``"`"``.

- Strikethrough text will be enclosed in double tildes ``"~~"``.

:arg bool clean: if ``True``, any hyphen "-" in the text is replaced by a ``"&#45;"`` character.

:arg bool fill_empty: if ``True``, empty cells will be filled with a copy of neighboring cells in an effort to indicate potential column and row spans.

* For each row and starting with index 1, the cell content will be replaced with the content of its left neighbor if it is ``None``.

* For each column and starting with index 1, the cell content will be replaced with the content of its upper neighbor if it is ``None``.


.. method:: to_pandas()

Returns a `pandas.DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_ representing the table. The DataFrame offers a plethora of functions, among them conversion to 20+ file formats (CSV, markdown, JSON, Excel, HD5 etc.). Where necessary, the table can be refined in multiple ways (e.g. deleting empty rows or columns) and mutliple DataFrames can be joined.


.. class:: TableHeader

.. attribute:: names

A list of strings representing the column names of the `Table`. This is usually the text content of the top row cells, but may instead be content identified above the detected table. The respective situation is encoded in the following attribute.

.. attribute:: is_external

Whether the header is part of the originally detected table (``False``) or was identified above the table (``True``). If ``True``, the header is not part of the table, but is used to identify the columns in the table. In this case, the header text will be used as column names in the extracted data.

.. attribute:: bbox

The bounding box of the header given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the header. If the header is not part of the table, this will be the rectangle that contains all cells of the header text, otherwise it is equal of the top row's boundary box.

.. attribute:: cells

A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in the header. Note that cells may also be ``None``, which will happen to prevent any gaps in a rows x columns structure. If the header is not part of the table, this will be the bounding boxes of the header text.


.. class:: TableRow

An object defining a row in a `Table` found on the page. Attributes of interest:

.. attribute:: bbox

The bounding box of the row given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the row.

.. attribute:: cells

A list of tuples of boundary boxes `(x0, y0, x1, y1)` of the cells in this row. Note that cells may also be ``None`` objects, which will happen to prevent gaps in a rows x columns structure.

.. note::

Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.

However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.

.. rubric:: Footnotes

Expand Down
130 changes: 127 additions & 3 deletions src/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,18 +89,130 @@
Matrix,
TEXTFLAGS_TEXT,
TEXT_FONT_BOLD,
TEXT_FONT_ITALIC,
TEXT_FONT_MONOSPACED,
TEXT_FONT_SUPERSCRIPT,
TEXT_COLLECT_STYLES,
TOOLS,
EMPTY_RECT,
sRGB_to_pdf,
Point,
message,
mupdf,
)

EDGES = [] # vector graphics from PyMuPDF
CHARS = [] # text characters from PyMuPDF
TEXTPAGE = None
TEXT_BOLD = mupdf.FZ_STEXT_BOLD
TEXT_STRIKEOUT = mupdf.FZ_STEXT_STRIKEOUT
FLAGS = TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES

white_spaces = set(string.whitespace) # for checking white space only cells


def extract_cells(textpage, cell, markdown=False):
"""Extract text from a rect-like 'cell' as plain or MD style text.

This function should ultimately be used to extract text from a table cell.
Markdown output will only work correctly if extraction flag bit
TEXT_COLLECT_STYLES is set.

Args:
textpage: A PyMuPDF TextPage object. Must have been created with
TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES.
cell: A tuple (x0, y0, x1, y1) defining the cell's bbox.
markdown: If True, return text formatted for Markdown.

Returns:
A string with the text extracted from the cell.
"""
text = ""
for block in textpage.extractRAWDICT()["blocks"]:
if block["type"] != 0:
continue
block_bbox = block["bbox"]
if (
0
or block_bbox[0] > cell[2]
or block_bbox[2] < cell[0]
or block_bbox[1] > cell[3]
or block_bbox[3] < cell[1]
):
continue # skip block outside cell
line_count = len(block["lines"])
for line in block["lines"]:
lbbox = line["bbox"]
if (
0
or lbbox[0] > cell[2]
or lbbox[2] < cell[0]
or lbbox[1] > cell[3]
or lbbox[3] < cell[1]
):
continue # skip line outside cell

if text: # must be a new line in the cell
text += "<br>" if markdown else "\n"

# strikeout detection only works with horizontal text
horizontal = line["dir"] == (0, 1) or line["dir"] == (1, 0)

for span in line["spans"]:
sbbox = span["bbox"]
if (
0
or sbbox[0] > cell[2]
or sbbox[2] < cell[0]
or sbbox[1] > cell[3]
or sbbox[3] < cell[1]
):
continue # skip spans outside cell

# only include chars with more than 50% bbox overlap
span_text = ""
for char in span["chars"]:
bbox = Rect(char["bbox"])
if abs(bbox & cell) > 0.5 * abs(bbox):
span_text += char["c"]

if not span_text:
continue # skip empty span

if not markdown: # no MD styling
text += span_text
continue

prefix = ""
suffix = ""
if horizontal and span["char_flags"] & TEXT_STRIKEOUT:
prefix += "~~"
suffix = "~~" + suffix
if span["char_flags"] & TEXT_BOLD:
prefix += "**"
suffix = "**" + suffix
if span["flags"] & TEXT_FONT_ITALIC:
prefix += "_"
suffix = "_" + suffix
if span["flags"] & TEXT_FONT_MONOSPACED:
prefix += "`"
suffix = "`" + suffix

if len(span["chars"]) > 2:
span_text = span_text.rstrip()

# if span continues previous styling: extend cell text
if (ls := len(suffix)) and text.endswith(suffix):
text = text[:-ls] + span_text + suffix
else: # append the span with new styling
if not span_text.strip():
text += " "
else:
text += prefix + span_text + suffix

return text.strip()


# -------------------------------------------------------------------
# End of PyMuPDF interface code
# -------------------------------------------------------------------
Expand Down Expand Up @@ -1382,7 +1494,18 @@ def to_markdown(self, clean=False, fill_empty=True):
output = "|"
rows = self.row_count
cols = self.col_count
cells = self.extract()[:] # make local copy of table text content

# cell coordinates
cell_boxes = [[c for c in r.cells] for r in self.rows]

# cell text strings
cells = [[None for i in range(cols)] for j in range(rows)]
for i, row in enumerate(cell_boxes):
for j, cell in enumerate(row):
if cell is not None:
cells[i][j] = extract_cells(
TEXTPAGE, cell_boxes[i][j], markdown=True
)

if fill_empty: # fill "None" cells where possible

Expand Down Expand Up @@ -1420,7 +1543,8 @@ def to_markdown(self, clean=False, fill_empty=True):
for i, cell in enumerate(row):
# replace None cells with empty string
# use HTML line break tag
cell = "" if not cell else cell.replace("\n", "<br>")
if cell is None:
cell = ""
if clean: # remove sensitive syntax
cell = html.escape(cell.replace("-", "&#45;"))
line += cell + "|"
Expand Down Expand Up @@ -1944,7 +2068,7 @@ def make_chars(page, clip=None):
page_number = page.number + 1
page_height = page.rect.height
ctm = page.transformation_matrix
TEXTPAGE = page.get_textpage(clip=clip, flags=TEXTFLAGS_TEXT)
TEXTPAGE = page.get_textpage(clip=clip, flags=FLAGS)
blocks = page.get_text("rawdict", textpage=TEXTPAGE)["blocks"]
doctop_base = page_height * page.number
for block in blocks:
Expand Down
Binary file added tests/resources/test-styled-table.pdf
Binary file not shown.
10 changes: 10 additions & 0 deletions tests/test_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,3 +423,13 @@ def test_4017():
["Weighted Average Life", "4.83", "<=", "9.00", "", "PASS", "4.92"],
]
assert tables[-1].extract() == expected_b


def test_md_styles():
"""Test output of table with MD-styled cells."""
filename = os.path.join(scriptdir, "resources", "test-styled-table.pdf")
doc = pymupdf.open(filename)
page = doc[0]
tabs = page.find_tables()[0]
text = """|Column 1|Column 2|Column 3|\n|---|---|---|\n|Zelle (0,0)|**Bold (0,1)**|Zelle (0,2)|\n|~~Strikeout (1,0), Zeile 1~~<br>~~Hier kommt Zeile 2.~~|Zelle (1,1)|~~Strikeout (1,2)~~|\n|**`Bold-monospaced`**<br>**`(2,0)`**|_Italic (2,1)_|**_Bold-italic_**<br>**_(2,2)_**|\n|Zelle (3,0)|~~**Bold-strikeout**~~<br>~~**(3,1)**~~|Zelle (3,2)|\n\n"""
assert tabs.to_markdown() == text
Loading