Description
Description of the bug
Based on my research, Mediabox defines size of the pdf page. Cropbox defines the rect of the page displayed by PDF Viewers. Pixmap displays the area intersected with Cropbox and Clip. And from my understanding and tests, the coordinates of text retrieved by functions such as Page.get_text()
is with respect to the Cropbox and unrotated. (Pymupdf should have better documentation regarding the coordinates.) I recently discovered that after using the function Page.find_tables()
on a rotated page where cropbox is smaller than mediabox. The Cropbox value changes, resulting subsequent Page.get_text()
coordinates to change.
Can someone please take a look? Thank you so much!
How to reproduce the bug
Please download the test pdf attached.
Install the following pymupdf in a clean virual environment
python -m venv venv pip install pymupdf=="1.25.3"
Run the following code
`
import fitz
# Read the pdf file
pdf_document = fitz.open("test.pdf")
page = pdf_document.load_page(0)
print("Before: ", page.cropbox, page.cropbox_position, page.rotation_matrix)
print("Before: ", page.search_for("第七章"))
print("\n")
# Find the tables
tables = page.find_tables()
print("After: ", page.cropbox, page.cropbox_position, page.rotation_matrix)
print("After: ", page.search_for("第七章"))
`
The result will look something like this. The page's Cropbox and its related property changes after running page.find_tables
`
Before: Rect(36.0, 36.0, 559.0, 805.9000244140625) Point(36.0, 36.0) Matrix(0.0, 1.0, -1.0, 0.0, 769.9000244140625, 0.0)
Before: [Rect(194.8800048828125, 38.02104568481445, 237.0097198486328, 52.66114807128906)]
After: Rect(0.0, 0.0, 595.2999877929688, 841.9000244140625) Point(0.0, 0.0) Matrix(0.0, 1.0, -1.0, 0.0, 841.9000244140625, 0.0)
After: [Rect(230.8800048828125, 74.02104187011719, 273.00970458984375, 88.66114807128906)]
`
PyMuPDF version
1.25.3
Operating system
MacOS
Python version
3.12