You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pymupdf.get_tessdata() function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).
# determine tessdata via iteration over subfolders
tessdata=None
forsub_responseinresponse.iterdir():
forsub_subinsub_response.iterdir():
ifstr(sub_sub).endswith("tessdata"):
tessdata=sub_sub
break
I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/.
The function get_tessdata() expects it in /usr/share/tesseract-ocr/4.00/tessdata, else it will search it with whereis tesseract-ocr.
However, it tries to iterdir on the subprocess response, even though it's a list of bytes, which raises the error.
I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata, and should find it in the second part of response. So I guess something like this should work?
Yeah, I know I should set the TESSDATA_PREFIX environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?
Thanks for developing PyMuPDF! :)
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.11
The text was updated successfully, but these errors were encountered:
MuPDF contains Tesseract 4.0 code to perform the OCR - it is integral part of the MuPDF binary.
The MuPDF team has stated that release 5.0 behavior is far less stable / predictable as necessary for MuPDF's purposes - details for this assessment should be best discussed with the team directly, e.g. on this Discord channel.
So what PyMuPDF's OCR is actually needed is exclusively the tessdata (language support) folder.
I cannot say whether a 5.0 tessdata has a format compatible to one of release 4.0.
But I definitely would suggest to use either the environment variable or the tessdata parameter.
Independently of the aforementioned, we should correct the behavior of the pymupdf function.
No problem. I made the tesseract installation detector version-independent.
But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata.
Description of the bug
The
pymupdf.get_tessdata()
function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).How to reproduce the bug
I haven't looked into the details yet, but I think the problem lays here:
PyMuPDF/src/__init__.py
Lines 18093 to 18099 in eca7066
I have the latest Debian with Tesseract OCR 5.3.0, installed in
/usr/share/tesseract-ocr/5/tessdata/
.The function
get_tessdata()
expects it in/usr/share/tesseract-ocr/4.00/tessdata
, else it will search it withwhereis tesseract-ocr
.However, it tries to
iterdir
on the subprocess response, even though it's a list of bytes, which raises the error.I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with
tessdata
, and should find it in the second part ofresponse
. So I guess something like this should work?Yeah, I know I should set the
TESSDATA_PREFIX
environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?Thanks for developing PyMuPDF! :)
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.11
The text was updated successfully, but these errors were encountered: