Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

White space BBOX is Wrong #823

Closed
mailsnathaniel opened this issue Jan 12, 2021 · 5 comments
Closed

White space BBOX is Wrong #823

mailsnathaniel opened this issue Jan 12, 2021 · 5 comments
Assignees

Comments

@mailsnathaniel
Copy link

mailsnathaniel commented Jan 12, 2021

Hi,

I White space bbox is wrong. I have even used ascender/decender to get the actual ymin and ymax.

I have attached the input and output (span chunks are marked in red outline).

FYI - This input pdf is created using ABBY OCR.

Configurations:

  • Ubuntu
  • Python3.6
  • PyMuPDF 1.18.6

Thanks
spaces_bbox
cheesecake-20191221_003.pdf

@JorjMcKie
Copy link
Collaborator

Found the reason:
As you know, the fontsize of a span plays a pivotal role when creating the small span bboxes. MuPDF returns me large fontsize values in case of most of the spans that have spces in their text, see for example here for the very large bbox after "Sub Total:"

      {
       "size":10.201054573059082,
       "flags":0,
       "font":"ArialMT",
       "color":0,
       "ascender":1.0750000476837158,
       "descender":-0.29899999499320984,
       "text":"Total:",
       "origin":[
        129.0,
        228.2010040283203
       ],
       "bbox":[
        129.0,
        217.96078491210938,
        156.18267822265625,
        231.04920959472656
       ]
      },
      {
       "size":29.90174102783203,  # <===  look at this! Compare to bbox height!
       "flags":0,
       "font":"ArialMT",
       "color":0,
       "ascender":1.0750000476837158,
       "descender":-0.29899999499320984,
       "text":" ",
       "origin":[
        156.0,
        228.2010040283203
       ],
       "bbox":[  # but the bbox height is just over 13, less than 50% of fontsize!
        156.0,
        217.98611450195312,
        182.1584930419922,
        231.0421600341797
       ]
      },

Whatever the reason for this may be: all you can do is rejecting the "reduced" span bbox if it has no smaller height than the original.
I will adjust the recipe in the documentation accordingly, and also the code behind fitz.TOOLS.set_small_bbox_heights().

@JorjMcKie
Copy link
Collaborator

Once this is done, the result will look like this, and that's the end of it:
grafik

@JorjMcKie
Copy link
Collaborator

Actually, this ridiculous fontsize of 29.9 is contained in the PDF - so ABBYY is to blame, not (Py-) MuPDF.

You even must say, that it is a pretty good job to come up with a reasonable bbox under these circumstances!

@JorjMcKie JorjMcKie added enhancement and removed bug labels Jan 12, 2021
@JorjMcKie
Copy link
Collaborator

Fixed by v1.18.7, currently being uploaded.

@mailsnathaniel
Copy link
Author

mailsnathaniel commented Feb 5, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants