Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_texttrace returned a incorrect character bbox #2533

Closed
little-polka-dot opened this issue Jul 11, 2023 · 8 comments
Closed

get_texttrace returned a incorrect character bbox #2533

little-polka-dot opened this issue Jul 11, 2023 · 8 comments
Labels

Comments

@little-polka-dot
Copy link

little-polka-dot commented Jul 11, 2023

Describe the bug (mandatory)

get_texttrace returned a incorrect character bbox

To Reproduce (mandatory)

盐城高新区投资集团有限公司2023年度第四期超短期融资券募集说明书.pdf

d = fitz.Document('temp/盐城高新区投资集团有限公司2023年度第四期超短期融资券募集说明书.pdf')
page = d[250]
for span in page.get_texttrace():
    for char in span['chars']:
        if chr(char[0]) == '民':
            print(char)

The above code will print this message:

(27665, 8775, (114.0, 634.72998046875), (114.0, 633.2604370117188, 124.44999694824219, 643.71044921875)

Origin y (634.72998046875) is too close to bbox y0 (633.2604370117188), and this is obviously not right.

Screenshot of the incorrect character bbox (red).
test

1689124136230(1)

Screenshot code:

page.get_pixmap(matrix=fitz.Matrix(2, 2), alpha=False, clip=fitz.Rect(114.0, 633.2604370117188, 124.44999694824219, 643.71044921875)).save(f'test.png')

Your configuration (mandatory)

print(sys.version, "\n", sys.platform, "\n", fitz.doc)
3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]
win32

PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.7 on win32 (64-bit).

@little-polka-dot
Copy link
Author

little-polka-dot commented Jul 13, 2023

Is this a bug or just a special case?

@JorjMcKie JorjMcKie added the bug label Jul 13, 2023
@JorjMcKie
Copy link
Collaborator

It is a bug! I am working on this already.

@little-polka-dot
Copy link
Author

😀Haha, thank you for answer.
Looking forward to your good news.

@JorjMcKie
Copy link
Collaborator

Your PDF nevertheless IS a special case in the way it is made - which is the reason why this bug was revealed.
Thank you!

JorjMcKie added a commit that referenced this issue Jul 14, 2023
In some situations, we were computing a wrong character bbox in Page method get_texttrace() in that we falsely assume a text up-down flip.
This is being corrected here.
@JorjMcKie
Copy link
Collaborator

Just implemented a fix. Should roll out in one of the next versions.

@little-polka-dot
Copy link
Author

Haha, nice😀

@JorjMcKie
Copy link
Collaborator

We do no longer support Python 3.7 because this has been retired end of June.
So you need to upgrade. I recommend doing this now. We will not generate wheels for Python 3.7 for the next version.

@little-polka-dot
Copy link
Author

little-polka-dot commented Jul 14, 2023

I see, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants