-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid characters in versions >= 1.22 #2553
Comments
Issue also confirmed for Windows. Will be fixed in coming version. Background: This fix removes that difference, causing all output from variants "text", "words", "blocks", "dict"/"json" and "rawdict"/"rawjson" to make use of Python's |
For text extraction `get_text("words")`, or extractWORDS, words are defined as strings not containing white space. This change allows adding up to 64 characters to also function as delimiters. This allows for instance to separate words from punctuations or to decompose an e-mail address into its components. Other changes: Fixing #2522: correcting the typo Remove some unnecessary setting of flags when creating annotations. Fixing #2553: Adjust plain text extraction to use the same approach as other variants. This entails using Unicode escape strings on output instead of using the output of fz_chartorune. Another consequence is that standard text output is directed to a fz_buffer instead to a fz_output. Fixing #2556: Add checking the existence of path dictionaries at every possible place. Includes an additional test function. Add functions JM_ignore_rect / JM_ignore_irect which return a bool. The functions return True if the rectangle should be ignored. This is the case for infinite and empty rectangles, but also for any rectangle that has a common edge with the infinite rectangle. Support variable setting of character border widths for insert_text() / insert_textbox(). This is a factor to be multiplied with the font size. Default is 0.05 (read: 5% of the fontsize). This value is relevant for text rendering modes 1 and 2 only. Fixing #2637: In Page.insert_textbox, when the last word of a line won't fit in the line buffer, we did not increase the line position. This is now handled correctly.
Previously, the output of plain text converted characters via fz_chartorune and "words", "blocks", "dict" and "rawdict" handled character conversions differently, using Python raw unicode decoding. A yet somewhat different behavior was used in page.get_textbox() - which is plain text extraction from within a rectangle independent from using a clip. This fix ensures that plain text extraction (including textbox) deliver the same output. This is checked via comparing the set of characters produced in each of the cases.
Thanks for getting a fix in. Any estimate on when you'll cut the next release? |
Fixed in 1.23.6. |
Hello everyone. It seems like this issue keeps happening specifically with this PDF Universal_Registration_Document_2023_VUK.pdf Using version 1.21.1 text gets extracted correctly using However, using version 1.23.6 produces them. Could you confirm this, please? Sample code attached below.
|
@vicent4no there has been a bug report for the underlying library MuPDF for exactly this problem. Background: |
Thank you for your time and thorough explanation, I appreciate it. |
Ensure that this build contains MuPDF fix https://bugs.ghostscript.com/show_bug.cgi?id=707045
Ensure that this build contains MuPDF fix https://bugs.ghostscript.com/show_bug.cgi?id=707045
Fixed in 1.23.7. |
Describe the bug (mandatory)
get_text()
in versions >=1.22 produces � characters in some cases, usually related to LaTex. This was not an issue inv1.21.1
and other PDF libraries extract the text just fine (thoughpdfplumber
appears to miss a few characters)Additionally,
get_text(sort=True)
converts the � to\udc52
which creates other issues e.g. causesprint()
to fail with errorUnicodeEncodeError: 'utf-8' codec can't encode character '\udc52' in position 35: surrogates not allowed
1001.2481.pdf
To Reproduce (mandatory)
PyMuPDF v1.21.1
Expected behavior (optional)
I expect the text to be extracted like it was in v1.21.1. If there are invalid characters, I'd also expect the sort to keep the characters the same.
Your configuration (mandatory)
Version date: 2023-06-21 00:00:01.
Built for Python 3.10 on linux (64-bit).
The text was updated successfully, but these errors were encountered: