Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs current version of tesseract #2

Closed
ghost opened this issue Dec 7, 2013 · 4 comments
Closed

Needs current version of tesseract #2

ghost opened this issue Dec 7, 2013 · 4 comments

Comments

@ghost
Copy link

ghost commented Dec 7, 2013

I tried running pypdfocr on a raspberry pi. The native tesseract version in the raspbian repositories is 3.02.01-6. The hocr html generated by this version has slightly different headers than in the current version. This leads to the nasty bug that everything seems to work, just that the parser in pypdfocr never adds text to the PDF, nor does it find any usable text to file.

The problem can be fixed by installing the current version of tesseract (3.02.02).

Suggested Fix: Add the tesseract version to the requirements in the Readme.md file.

@virantha
Copy link
Owner

virantha commented Dec 8, 2013

Didn't realize tessearct had changes to the output format in those different minor releases. I know it worked fine in 3.01, and I must have jumped to 3.02.02 without getting the intermediate change. Thanks for the heads-up. I'll add the version number, and maybe a check to the invocation.

@ghost
Copy link
Author

ghost commented Dec 8, 2013

Hi,
I just tried to reproduce the error. Originally tesseract from raspbian repositories for me produced something different from the search string http://www.w3.org/1999/xhtml. It seems that now it does the same as my self-compiled version. I am still trying to find out what brought the change. I am guessing that installing the prerequisites for compiling tesseract I might have updated a library that is used for writing the html. I will try again from a clean raspbian sometime this week and report back.

@ghost
Copy link
Author

ghost commented Dec 8, 2013

Okay, I can reproduce the difference in html on a clean raspbian. Excerpt from headers using raspbian version of tesseract on clean pi:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract'/>
</head>
<body>

Now the same from the self-compiled version on the pi (on my ubuntu and mac boxes it looks similar)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.02.02' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
 </head>
 <body>

I will try to find out more about what makes the output of those two different. Currently on my first pi, where I installed everything to compile tesseract myself, I get the same (correct) output as the second snippet. Even for the tesseract from the raspbian repositories.

Only on the installed-from-scratch-raspbian do I get the not-so-correct output from the first snippet. So it may not be the tesseract-version but some library used in tesseract. I will report back when I know more.

@virantha
Copy link
Owner

Closing this as this is probably not an issue for most people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant