Invalid size of TextPage and bbox with newest version 1.21.0 #2048

jn-chrn · 2022-11-15T07:51:24Z

Describe the bug

Reading some text from PDF files using textpage.extractDICT() returns invalid dimensions with version 1.21.0

To Reproduce

To reproduce, please use this piece of code which:

opens the attached PDF
gets a TextPage from the only page of the document
computes the size of the page for comparison
gets the width and height of the TextPage
- the size of the TextPage is clearly invalid
gets the bbox of the first span inside the first span of the first block
- the bbox dimentsions are clearly invalid

import fitz

document: fitz.Document = fitz.open("crop.pdf")
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

# The file's size is about 47.4 x 14.0
assert abs(page_rect.width - 47.4) < 0.1
assert abs(page_rect.height - 14.0) < 0.1

# WRONG HERE ALREADY:
# The returned size of the page is '4294967168.0 x 4294967168.0'
assert abs(texts_as_dict["width"] - 47.4) < 0.1
assert abs(texts_as_dict["height"] - 14.0) < 0.1

first_span = texts_as_dict["blocks"][0]["lines"][0]["spans"][0]
bbox = first_span["bbox"]

# The size of the bbox return with version 1.19.6 is:
# '(29.58..., 2.87..., 35.07..., 10.60...)'
assert bbox[2] < 50  # ERROR: returned value '1044369984.0'
assert bbox[3] < 50  # ERROR: returned value '13269935104.0'

Attached PDF: crop.pdf

Expected behavior

With PyMuPDF version 1.19.6, the size of the extracted bbox was very small. With the newest version, its size became way too large (with a factor of 1e8).

Your configuration

print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.10.6 (main, Oct  7 2022, 20:19:58) [GCC 11.2.0] 
 linux 
 
PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.10 on linux (64-bit).

PyMuPDF was installed using pip install pymupdf.

The text was updated successfully, but these errors were encountered:

…ot specified.

…ified.

julian-smith-artifex-com · 2022-11-15T13:08:37Z

Thanks for this report and the reproduccer.

I've just pushed a change so that get_textpage() (and therefore extractDICT()) defaults to setting the rect to the page's rect, unless a clip rect is explicitly passed in.

This fixes the failure of your test programme, and will be in the next release.

(Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.)

jn-chrn · 2022-11-15T13:20:49Z

Thank you for the fast fix!

Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.

I tried again and did not get this issue, there are 3 elements in the list of spans when I try locally. Then the bboxes of all these spans are also very large.

jn-chrn · 2022-11-15T14:57:50Z

Just to make it clear again, there are two issues:

at the top level of the dictionary of extracted text (with text_page.extractDICT()), the width and height are invalid
at the level of "span" elements, the bbox is invalid on some PDF files we have, and is invalid on the first span in the attached file

JorjMcKie · 2022-11-16T15:13:00Z

@jn-chrn admittedly, this PDF has some very, very unusual specifications and fonts:

the MediaBox does not start at (0,0) but at (1063.9544, 1001.37216). The CropBox is identical to the MediaBox.
the relevant fonts are Type3 with invalid font bboxes, fitz.Rect(0,0,0,0). And the critical values for character geometry computations, font.ascender / font.descender are unusable, namely equal to the max. C float value - which is the direct reason for computing infinite bboxes.

PyMuPDF's get_text("dict",...) method computes span / line / block boundary boxes as the rectangle unions of the single characters contained therein (which is inevitable for technical reasons). So this explains those infinite reactangles.

The PyMuPDF-specific logic to validate character bboxes can be switched off via fitz.TOOLS.unset_quad_corrections(True) in which case the original MuPDF computations will prevail.
In this case, this remedy won't work either: The bboxes are no longer infinite, but still crazy enough.

Anyway, if doing get_text(<any-option>, clip=page.rect) will deliver no text all.

JorjMcKie · 2022-11-17T15:14:53Z

@jn-chrn - just encountered a spot in the code, where character bbox calculation will go wrong if font ascender / descender take on max C float values - which is the case here.
I am making progress and will be right back once the situation is clarified.

JorjMcKie · 2022-11-18T16:04:48Z

As mentioned before, it's the fault of those preculiar Type3 fonts. Because they deliver nonsense values for data that are required for bbox computation, some ersatz assumptions must be made. The best result I so far achieve looks like this for your case:

The block/line/span bbox (black border) has these values (the blue boxes are single characters):

'bbox': (22.474653244018555,
           3.4806418418884277,
           34.903072357177734,
           8.929698944091797),

To achieve this, the script must use fitz.Tools().set_small_glyph_heights(True) to enforce corrective bbox / character quad computations ...

jn-chrn · 2022-11-21T16:10:46Z

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.

An important note: no large bbox was there with 1.19.6! But with the latest version (1.21.0), we got many of them.

The following code returns, for the bboxes with a width higher than 10^6:

a count of 309 bboxes with version 1.21.0
a count of 0 bboxes with version 1.19.6

import fitz

document: fitz.Document = fitz.open(
    "crop.pdf"
)
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

counter = 0
for block in texts_as_dict["blocks"]:
    for line in block["lines"]:
        direction = line["dir"]
        for span in line["spans"]:
            quad: fitz.Quad = fitz.recover_quad(line_dir=direction, span=span)
            if quad.width > 1e6:
                counter += 1

print(counter)

So this small PDFs has many bboxes which are very large with the latest version, but none for older version. This issue only started to occur after 1.19.6.

JorjMcKie · 2022-11-21T17:22:15Z

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.

Don't take my comment personal 😉.
You are right, that page obviously is being "cut out" from a much larger one.
There is some problem within the code creating the TextPage (in MuPDF). In the most current version, the Type3 font is no longer interpreted correctly.
This leads to those crazy large bboxes and character widths. I have developed corrective code in PyMuPDF, which delivers reasonable results, when following this coding pattern:

import fitz
import sys

vsn = f"-{sys.version_info[0]}-{sys.version_info[1]}"

# following ensures using PyMuPDF corrections:
fitz.TOOLS.set_small_glyph_heights(True)

doc = fitz.open("crop.pdf")
page = doc[0]
page.clean_contents()  # make sure page.draw_rect() lands in right place

blocks = page.get_text(
    "dict",
    clip=page.rect,  # only look at visible page
    flags=fitz.TEXTFLAGS_TEXT,  # only look at text
)["blocks"]
for b in blocks:
    page.draw_rect(b["bbox"], width=0.2, color=fitz.pdfcolor["green"])
    for l in b["lines"]:
        for s in l["spans"]:
            print(s["text"])
doc.ez_save(f"zdict{vsn}.pdf")

Output:

py testdict.py
km

1.6

And

Internally, I also had to change the decision whether a character should be regarded inside the "clip" from: "bbox is completely inside clip" to: "character origin is inside clip".
Where "origin" is the bottom left point of a character (glyph) - where drawing of it starts.

JorjMcKie · 2022-11-22T15:27:42Z

I have submitted a related bug in MuPDF's issue system.

jn-chrn · 2022-11-24T08:59:43Z

Thanks for the insight, and the fast answer (as always)!

Don't take my comment personal

(I had to defend my poor little stupidly made PDF 😄 )

julian-smith-artifex-com · 2022-12-13T14:33:12Z

Fixed in PyMuPDF-1.21.1.

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Nov 15, 2022

fitz/fitz.i: pymupdf#2048: in get_textpage(), use page rect if clip n…

b1aced4

…ot specified.

julian-smith-artifex-com added a commit that referenced this issue Nov 15, 2022

fitz/fitz.i: #2048: in get_textpage(), use page rect if clip not spec…

824be2e

…ified.

jn-chrn closed this as completed Nov 15, 2022

jn-chrn reopened this Nov 15, 2022

JorjMcKie added the upstream bug bug outside this package label Nov 20, 2022

JorjMcKie added the Fixed in next release label Dec 7, 2022

julian-smith-artifex-com removed the Fixed in next release label Dec 13, 2022

julian-smith-artifex-com closed this as completed Dec 13, 2022

julian-smith-artifex-com mentioned this issue Feb 20, 2023

Bug - can not extract data from file in the newest version 1.21.1 #2238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

jn-chrn commented Nov 15, 2022 •

edited by JorjMcKie

Loading

julian-smith-artifex-com commented Nov 15, 2022

jn-chrn commented Nov 15, 2022

jn-chrn commented Nov 15, 2022

JorjMcKie commented Nov 16, 2022

JorjMcKie commented Nov 17, 2022

JorjMcKie commented Nov 18, 2022

jn-chrn commented Nov 21, 2022 •

edited

Loading

JorjMcKie commented Nov 21, 2022

JorjMcKie commented Nov 22, 2022

jn-chrn commented Nov 24, 2022

julian-smith-artifex-com commented Dec 13, 2022

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

Comments

jn-chrn commented Nov 15, 2022 • edited by JorjMcKie Loading

Describe the bug

To Reproduce

Expected behavior

Your configuration

julian-smith-artifex-com commented Nov 15, 2022

jn-chrn commented Nov 15, 2022

jn-chrn commented Nov 15, 2022

JorjMcKie commented Nov 16, 2022

JorjMcKie commented Nov 17, 2022

JorjMcKie commented Nov 18, 2022

jn-chrn commented Nov 21, 2022 • edited Loading

JorjMcKie commented Nov 21, 2022

JorjMcKie commented Nov 22, 2022

jn-chrn commented Nov 24, 2022

julian-smith-artifex-com commented Dec 13, 2022

jn-chrn commented Nov 15, 2022 •

edited by JorjMcKie

Loading

jn-chrn commented Nov 21, 2022 •

edited

Loading