Replies: 1 comment 5 replies
-
I don't know what is going wrong here yet. But at least I know how to get the right results ... The following script looks at line-like "f"-type rectangles and adds them again as helper bboxes ( import pymupdf
doc = pymupdf.open("test.pdf")
for page in doc:
paths = page.get_drawings()
rects = [] # collect f-type / line-like rectangles
for p in paths:
if p["type"] == "f" and (p["rect"].width <= 3 or p["rect"].height <= 3):
rects.append(p["rect"])
tabs = page.find_tables(
strategy="lines_strict", # ignore fill-only rect-like areas
paths=paths, # do not extract vector graphics again
add_boxes=rects, # use converted line-like rectangles
)
for tab in tabs.tables:
for cell in tab.cells:
if cell:
page.draw_rect(cell, color=(1, 0, 0))
doc.ez_save("test-out.pdf") The output PDF demonstrates, that all expected table cells are indeed detected. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
SPARE-PART-LIST-Chef-Cream-130-Quick-i-400V_3ph_50Hz-NO-PRICES.pdf
I have this particular pdf file.
I want to extract data from Page 26, which has a table. Table has grid lines. And when I use below code:
I get below output:

The interesting lines are highlighted. Multiple rows are being snapped into one row. I tried using snap_y_tolerance parameter, and it doesn't have any effect on the result.
One more interesting page is the 10th one. It has a table, but when I was using lines or lines_strict it couldn't find that table.
Beta Was this translation helpful? Give feedback.
All reactions