Multiple Rows being merged into one #4562

1504168 · 2025-06-16T12:06:42Z

1504168
Jun 16, 2025

SPARE-PART-LIST-Chef-Cream-130-Quick-i-400V_3ph_50Hz-NO-PRICES.pdf

I have this particular pdf file.
I want to extract data from Page 26, which has a table. Table has grid lines. And when I use below code:

import pymupdf
doc = pymupdf.open('SPARE-PART-LIST-Chef-Cream-130-Quick-i-400V_3ph_50Hz-NO-PRICES.pdf')
page = doc[25]
tables = page.find_tables(
                            clip = [0,page.search_for('PZZ')[0][1]-2,page.rect[2],page.rect[3]-20]
                            , horizontal_strategy = "lines_strict"
                            , vertical_strategy = "lines_strict"
                            ).tables
print(len(tables))
tables[0].to_pandas()

I get below output:

The interesting lines are highlighted. Multiple rows are being snapped into one row. I tried using snap_y_tolerance parameter, and it doesn't have any effect on the result.

One more interesting page is the 10th one. It has a table, but when I was using lines or lines_strict it couldn't find that table.

JorjMcKie · 2025-06-16T16:17:47Z

JorjMcKie
Jun 16, 2025
Maintainer

I don't know what is going wrong here yet. But at least I know how to get the right results ...
The visible grid lines in your PDF are technically implemented as fill-only slim rectangles - called "line-like" rectangles.

The following script looks at line-like "f"-type rectangles and adds them again as helper bboxes (add_boxes).

import pymupdf

doc = pymupdf.open("test.pdf")
for page in doc:
    paths = page.get_drawings()
    rects = []  # collect f-type / line-like rectangles
    for p in paths:
        if p["type"] == "f" and (p["rect"].width <= 3 or p["rect"].height <= 3):
            rects.append(p["rect"])
    tabs = page.find_tables(
        strategy="lines_strict",  # ignore fill-only rect-like areas
        paths=paths,  # do not extract vector graphics again
        add_boxes=rects,  # use converted line-like rectangles
    )
    for tab in tabs.tables:
        for cell in tab.cells:
            if cell:
                page.draw_rect(cell, color=(1, 0, 0))
doc.ez_save("test-out.pdf")

The output PDF demonstrates, that all expected table cells are indeed detected.

5 replies

JorjMcKie Jun 16, 2025
Maintainer

I am continuing to search for the root cause of the problem.

JorjMcKie Jun 17, 2025
Maintainer

I have developed a fix in the table module. This processes all tables in your example correctly.

1504168 Jun 17, 2025
Author

I don't know what is going wrong here yet. But at least I know how to get the right results ... The visible grid lines in your PDF are technically implemented as fill-only slim rectangles - called "line-like" rectangles.

The following script looks at line-like "f"-type rectangles and adds them again as helper bboxes (add_boxes).

import pymupdf

doc = pymupdf.open("test.pdf")
for page in doc:
    paths = page.get_drawings()
    rects = []  # collect f-type / line-like rectangles
    for p in paths:
        if p["type"] == "f" and (p["rect"].width <= 3 or p["rect"].height <= 3):
            rects.append(p["rect"])
    tabs = page.find_tables(
        strategy="lines_strict",  # ignore fill-only rect-like areas
        paths=paths,  # do not extract vector graphics again
        add_boxes=rects,  # use converted line-like rectangles
    )
    for tab in tabs.tables:
        for cell in tab.cells:
            if cell:
                page.draw_rect(cell, color=(1, 0, 0))
doc.ez_save("test-out.pdf")

The output PDF demonstrates, that all expected table cells are indeed detected.

Thank you for this example code. It is working perfectly fine. I was using version 1.25 but upgraded it to 1.26, and now it is working fine. I would love to know more about how you analyze this kind of problem. Like, do you extract drawing info and figure out why it is not working as expected? I am asking this so that in the future I can do the initial debugging and figure out things.

1504168 Jun 17, 2025
Author

I have developed a fix in the table module. This processes all tables in your example correctly.

Is this fix going to be added in the next version release? When can we use it?

JorjMcKie Jun 17, 2025
Maintainer

Is this fix going to be added in the next version release? When can we use it?

Yes, this will become part of the next release.
The change affects file table.py exclusively - no implications to the outside. If you want to hack your Python environment you can already today test with it. Just drop me a note.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple Rows being merged into one #4562

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multiple Rows being merged into one #4562

Uh oh!

1504168 Jun 16, 2025

Replies: 1 comment · 5 replies

Uh oh!

JorjMcKie Jun 16, 2025 Maintainer

Uh oh!

JorjMcKie Jun 16, 2025 Maintainer

Uh oh!

JorjMcKie Jun 17, 2025 Maintainer

Uh oh!

1504168 Jun 17, 2025 Author

Uh oh!

1504168 Jun 17, 2025 Author

Uh oh!

JorjMcKie Jun 17, 2025 Maintainer

1504168
Jun 16, 2025

Replies: 1 comment 5 replies

JorjMcKie
Jun 16, 2025
Maintainer

JorjMcKie Jun 16, 2025
Maintainer

JorjMcKie Jun 17, 2025
Maintainer

1504168 Jun 17, 2025
Author

1504168 Jun 17, 2025
Author

JorjMcKie Jun 17, 2025
Maintainer