-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug - can not extract data from file in the newest version 1.21.1 #2238
Comments
Thanks for the clear report, i have reproduced the issue. I suspect it's a change in MuPDF rather then PyMuPDF itself; will see what the MuPDF people think next week. |
It looks like this is not caused by a change in MuPDF after all. Instead it's caused by PyMuPDF's fix for #2048, where it defaults to clipping to the page mediabox. Unfortunately PyMuPDF's text clipping only includes glyphs whose bounding boxes are entirely included in the clip rect. Even though the A workaround is to specify an infinite cliprect when calling
[In the next release we might look into supporting 'overlap' semantics as well as, or instead of, the current 'contained' semantics.] |
Your workaround works, thanks 👍.
It would be nice if I would not have to use the workaround, but at least it works now. Also, I would be happy if you would use the provided file in your test suite. |
…xtracting text. Also fixed Story.draw() to handle exceptions e.g. from fz_draw_story().
…xtracting text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox.
My tree now uses 'overlap' semantics rather than 'contains', which fixes the problem. [But i haven't yet pushed to github.] Thanks for the offer to use your file in the test suite, i've done so in my tree. |
…xtracting text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox.
…xtracting text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox. Note that new fn JM_rects_overlap() still returns true if one of the rects is empty. This allows things to work with ligatures, where component glyphs can have zero width.
…ng text. We now include chars that overlap with the clipbox, instead of only those that are entirely contained within the clipbox. Note that new fn JM_rects_overlap() still returns true if one of the rects is empty. This allows things to work with ligatures, where component glyphs can have zero width.
Bug description
Since version 1.21.1, I have a problem with extracting data from files having some content before header or after EOF. In the older version 1.21.0 (or any older version) there was no problem. Firefox, for example, has no issue opening the file.
To Reproduce
test.py
PDF to download
Example file: test.pdf
How to run:
To reproduce this problem you can run the program above with the following file test.pdf.
My configuration
The text was updated successfully, but these errors were encountered: