-
Notifications
You must be signed in to change notification settings - Fork 152
Description
Problem Summary: I am using pymupdf4llm.to_markdown()
to extract content from a PDF, but I noticed that the "graphics" attribute in the metadata remains empty, even though a vector graphic is expected.
According to the API documentation, the attribute is described as follows: ”graphics” - a list of vector graphics rectangles on the page. This is a list of boundary boxes of clustered vector graphics as delivered by method Page.cluster_drawings()
. When trying to use this aforementioned PyMuPDF method separately (see code below), I'm able to retrieve the expected vector graphic.
Expected Behavior: The "graphics" attribute in the Markdown metadata should list the detected vector graphics, consistent with what Page.cluster_drawings()
return.
Actual Behavior: Despite the presence of vector graphics, the "graphics" field remains empty in the Markdown output.
Versions:
- PyMuPDF: 1.25.5
- pymupdf4llm: 0.0.19
Code:
import pymupdf
import pymupdf4llm
file_path = "../assets/TestDoc.pdf"
doc = pymupdf.open(file_path) # Explicitly open with PyMuPDF
page = doc[1]
vector_dic_paths = page.get_drawings()
clusters = page.cluster_drawings(drawings=vector_dic_paths, x_tolerance=3, y_tolerance=3)
print(f"Manually found {len(clusters)} clustered vector graphics.") # Finding one clustered vector on second page of PDF.
md_text = pymupdf4llm.to_markdown(
doc,
write_images=True,
image_path="./extracted_images",
page_chunks=True,
force_text=True,
extract_words=False
)
print(md_text) # Empty "graphics" attribute for the second page of the PDF.