Cannot get Tessdata with Tesseract-OCR 5 #3767

rezemika · 2024-08-10T13:06:28Z

Description of the bug

The pymupdf.get_tessdata() function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).

>>> import pymupdf
>>> pymupdf.get_tessdata()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<...>/venv/lib/python3.11/site-packages/pymupdf/__init__.py", line 18082, in get_tessdata
    for sub_response in response.iterdir():
                        ^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'iterdir'

>>> pymupdf.version
('1.24.9', '1.24.8', '20240724000001')

How to reproduce the bug

I haven't looked into the details yet, but I think the problem lays here:

PyMuPDF/src/__init__.py

Lines 18093 to 18099 in eca7066

    
           # determine tessdata via iteration over subfolders 
        
           tessdata = None 
        
           for sub_response in response.iterdir(): 
        
               for sub_sub in sub_response.iterdir(): 
        
                   if str(sub_sub).endswith("tessdata"): 
        
                       tessdata = sub_sub 
        
                       break

I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/.
The function get_tessdata() expects it in /usr/share/tesseract-ocr/4.00/tessdata, else it will search it with whereis tesseract-ocr.

However, it tries to iterdir on the subprocess response, even though it's a list of bytes, which raises the error.

>>> import subprocess
>>> cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
>>> cp
CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'')
>>> response = cp.stdout.strip().split()
>>> response
[b'tesseract-ocr:', b'/usr/share/tesseract-ocr']
>>> type(response), type(response[0])
(<class 'list'>, <class 'bytes'>)
>>> 
>>> response.iterdir()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'list' object has no attribute 'iterdir'

I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata, and should find it in the second part of response. So I guess something like this should work?

import subprocess
cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
response = cp.stdout.strip().split()
import pathlib
response_dir = pathlib.Path(response[1].decode("utf-8"))
# response_dir == PosixPath('/usr/share/tesseract-ocr')
for sub_dir in response_dir.iterdir():
    for sub_sub_dir in sub_dir.iterdir():
        if sub_sub_dir.name.endswith("tessdata"):
            tessdata = str(sub_sub_dir)
            break
# tessdata == '/usr/share/tesseract-ocr/5/tessdata'

Yeah, I know I should set the TESSDATA_PREFIX environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?

Thanks for developing PyMuPDF! :)

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.11

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-08-11T09:00:35Z

MuPDF contains Tesseract 4.0 code to perform the OCR - it is integral part of the MuPDF binary.

The MuPDF team has stated that release 5.0 behavior is far less stable / predictable as necessary for MuPDF's purposes - details for this assessment should be best discussed with the team directly, e.g. on this Discord channel.

So what PyMuPDF's OCR is actually needed is exclusively the tessdata (language support) folder.
I cannot say whether a 5.0 tessdata has a format compatible to one of release 4.0.
But I definitely would suggest to use either the environment variable or the tessdata parameter.

Independently of the aforementioned, we should correct the behavior of the pymupdf function.

rezemika · 2024-08-12T10:17:54Z

Oh my bad, thanks for these details!

JorjMcKie · 2024-08-12T10:52:06Z

No problem. I made the tesseract installation detector version-independent.
But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata.

julian-smith-artifex-com · 2024-09-02T16:43:36Z

Fixed in 1.24.10.

JorjMcKie added bug fix developed release schedule to be determined labels Aug 11, 2024

julian-smith-artifex-com closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot get Tessdata with Tesseract-OCR 5 #3767

Cannot get Tessdata with Tesseract-OCR 5 #3767

rezemika commented Aug 10, 2024 •

edited

Loading

JorjMcKie commented Aug 11, 2024

rezemika commented Aug 12, 2024

JorjMcKie commented Aug 12, 2024

julian-smith-artifex-com commented Sep 2, 2024

Cannot get Tessdata with Tesseract-OCR 5 #3767

Cannot get Tessdata with Tesseract-OCR 5 #3767

Comments

rezemika commented Aug 10, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Aug 11, 2024

rezemika commented Aug 12, 2024

JorjMcKie commented Aug 12, 2024

julian-smith-artifex-com commented Sep 2, 2024

rezemika commented Aug 10, 2024 •

edited

Loading