-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
page.get_text() returns hexadecimal text for some characters #3197
Comments
Thanks for the report. It looks like PyMuPDF with the latest MuPDF master branch does not include these control characters in the text. So this looks like a MuPDF issue. I'll ask the MuPDF people about what has changed on MuPDF master relative to PyMuPDF's default MuPDF-1.23.10. |
MuPDF master has support for ActualText which fixes this problem. We are expecting MuPDF to move to new release 1.24.x branch in the next few weeks which will include ActualText support, and so the problem will be fixed in PyMuPDF shortly afterwards. |
This is test for #3197. Fixed in MuPDF 1.24.
This is test for #3197. Fixed in MuPDF 1.24.
This is test for #3197. Fixed in MuPDF 1.24.
Fixed in 1.24.0. |
Description of the bug
get_text()
extracts numbers in the Cash Flow table in this document as hexadecimal characters. Copy/paste from the page andpdftotext
extract the correct text.How to reproduce the bug
Ford Motor Company (F) Cash Flow - Yahoo Finance - Yahoo Finance.pdf
Expected behavior (optional)
I expect the numbers in the table to be returned as normal text, similar to
pdftotext
PyMuPDF version
1.23.25
Operating system
Linux
Python version
3.10
The text was updated successfully, but these errors were encountered: