get_drawings() can't detect line #926
-
Please provide all mandatory information! Describe the bug (mandatory)get_drawings() can't detect line To Reproduce (mandatory)Here is the code you written in the past issues for recognizing tables by the lines. But after I update the version of Pymupdf, I can't get lines but all rects. def get_table_location(page: fitz.Page) -> [fitz.Rect]:
"""
Get the location of tables in page
by finding horizontal lines with same length
Parameters
----------
page: page object of pdf
Returns
-------
table_rects: rectangles that contain tables
"""
# make a list of horizontal lines
# each line is represented by y and length
hor_lines = []
paths = page.getDrawings()
pprint(paths)
for p in paths:
for item in p["items"]:
if item[0] == "l": # this is a line item
p1 = item[1] # start point
p2 = item[2] # stop point
if p1.y == p2.y: # line horizontal?
hor_lines.append((p1.y, p2.x - p1.x)) # potential table delimiter
# find whether table exists by number of lines with same length > 3
table_rects = []
# sort the list for ensuring the correct group by same keys
hor_lines.sort(key=lambda x: x[1])
# getting the top-left point and bottom-right point of table
for k, g in groupby(hor_lines, key=lambda x: x[1]):
g = list(g)
if len(g) >= 3: # number of lines of table will always >= 3
g.sort(key=lambda x: x[0]) # sort by y value
top_left = fitz.Point(0, g[0][0])
bottom_right = fitz.Point(page.rect.width, g[-1][0])
table_rects.append(fitz.Rect(top_left, bottom_right))
return table_rects this is the sample file. Expected behavior (optional)Detect lines and detect tables Your configuration (mandatory)
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 5 replies
-
Please tell me which was the PyMuPDF version, where this file was processed as you expected it. |
Beta Was this translation helpful? Give feedback.
-
In addition, looking at the second page for example: |
Beta Was this translation helpful? Give feedback.
-
If I draw the rectangles in red found on second page, this picture comes out: |
Beta Was this translation helpful? Give feedback.
-
Unbelievable! They are rectangles. |
Beta Was this translation helpful? Give feedback.
-
For reasons which remain his secret, the PDF creator decided to draw thin horizontal bars, instead of lines.
|
Beta Was this translation helpful? Give feedback.
For reasons which remain his secret, the PDF creator decided to draw thin horizontal bars, instead of lines.
You can easily find out yourself:
page.get_drawings()
) is a rectangle, then there is only one item inpath["items"]
, which then looks like("re", fitz.Rect(...))
.path["items"]
.path["rect"]
which is the rectangle envelopping the complete drawing, which this path represents.path["rect"].height
- it is afitz.Rect
.