-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid size of TextPage and bbox with newest version 1.21.0 #2048
Comments
Thanks for this report and the reproduccer. I've just pushed a change so that This fixes the failure of your test programme, and will be in the next release. (Note that your test programme fails later on because |
Thank you for the fast fix!
I tried again and did not get this issue, there are 3 elements in the list of spans when I try locally. Then the bboxes of all these spans are also very large. |
Just to make it clear again, there are two issues:
|
@jn-chrn admittedly, this PDF has some very, very unusual specifications and fonts:
PyMuPDF's The PyMuPDF-specific logic to validate character bboxes can be switched off via Anyway, if doing |
@jn-chrn - just encountered a spot in the code, where character bbox calculation will go wrong if font ascender / descender take on max C float values - which is the case here. |
Regarding the PDF file itself being unusual, it was created from a much larger file using An important note: no large bbox was there with 1.19.6! But with the latest version (1.21.0), we got many of them. The following code returns, for the bboxes with a width higher than
import fitz
document: fitz.Document = fitz.open(
"crop.pdf"
)
page = list(document.pages())[0]
page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()
counter = 0
for block in texts_as_dict["blocks"]:
for line in block["lines"]:
direction = line["dir"]
for span in line["spans"]:
quad: fitz.Quad = fitz.recover_quad(line_dir=direction, span=span)
if quad.width > 1e6:
counter += 1
print(counter) So this small PDFs has many bboxes which are very large with the latest version, but none for older version. This issue only started to occur after 1.19.6. |
I have submitted a related bug in MuPDF's issue system. |
Thanks for the insight, and the fast answer (as always)!
(I had to defend my poor little stupidly made PDF 😄 ) |
Fixed in PyMuPDF-1.21.1. |
Describe the bug
Reading some text from PDF files using
textpage.extractDICT()
returns invalid dimensions with version 1.21.0To Reproduce
To reproduce, please use this piece of code which:
TextPage
from the only page of the documentTextPage
Attached PDF: crop.pdf
Expected behavior
With PyMuPDF version 1.19.6, the size of the extracted bbox was very small. With the newest version, its size became way too large (with a factor of 1e8).
Your configuration
PyMuPDF was installed using
pip install pymupdf
.The text was updated successfully, but these errors were encountered: