Skip to content

Empty "graphics" attribute in to_markdown() function #241

@tuv-jan-hery

Description

@tuv-jan-hery

Problem Summary: I am using pymupdf4llm.to_markdown() to extract content from a PDF, but I noticed that the "graphics" attribute in the metadata remains empty, even though a vector graphic is expected.

According to the API documentation, the attribute is described as follows: ”graphics” - a list of vector graphics rectangles on the page. This is a list of boundary boxes of clustered vector graphics as delivered by method Page.cluster_drawings(). When trying to use this aforementioned PyMuPDF method separately (see code below), I'm able to retrieve the expected vector graphic.

Expected Behavior: The "graphics" attribute in the Markdown metadata should list the detected vector graphics, consistent with what Page.cluster_drawings() return.

Actual Behavior: Despite the presence of vector graphics, the "graphics" field remains empty in the Markdown output.

Versions:

  • PyMuPDF: 1.25.5
  • pymupdf4llm: 0.0.19

Code:

import pymupdf
import pymupdf4llm

file_path = "../assets/TestDoc.pdf"

doc = pymupdf.open(file_path)  # Explicitly open with PyMuPDF
page = doc[1]
vector_dic_paths = page.get_drawings()
clusters = page.cluster_drawings(drawings=vector_dic_paths, x_tolerance=3, y_tolerance=3)
print(f"Manually found {len(clusters)} clustered vector graphics.") # Finding one clustered vector on second page of PDF.

md_text = pymupdf4llm.to_markdown(
    doc,
    write_images=True,
    image_path="./extracted_images",
    page_chunks=True,
    force_text=True,
    extract_words=False
)
print(md_text) # Empty "graphics" attribute for the second page of the PDF.

TestDoc.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions