Needs current version of tesseract #2

ghost · 2013-12-07T22:00:25Z

I tried running pypdfocr on a raspberry pi. The native tesseract version in the raspbian repositories is 3.02.01-6. The hocr html generated by this version has slightly different headers than in the current version. This leads to the nasty bug that everything seems to work, just that the parser in pypdfocr never adds text to the PDF, nor does it find any usable text to file.

The problem can be fixed by installing the current version of tesseract (3.02.02).

Suggested Fix: Add the tesseract version to the requirements in the Readme.md file.

The text was updated successfully, but these errors were encountered:

virantha · 2013-12-08T20:38:13Z

Didn't realize tessearct had changes to the output format in those different minor releases. I know it worked fine in 3.01, and I must have jumped to 3.02.02 without getting the intermediate change. Thanks for the heads-up. I'll add the version number, and maybe a check to the invocation.

ghost · 2013-12-08T23:22:07Z

Hi,
I just tried to reproduce the error. Originally tesseract from raspbian repositories for me produced something different from the search string http://www.w3.org/1999/xhtml. It seems that now it does the same as my self-compiled version. I am still trying to find out what brought the change. I am guessing that installing the prerequisites for compiling tesseract I might have updated a library that is used for writing the html. I will try again from a clean raspbian sometime this week and report back.

ghost · 2013-12-08T23:49:12Z

Okay, I can reproduce the difference in html on a clean raspbian. Excerpt from headers using raspbian version of tesseract on clean pi:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract'/>
</head>
<body>

Now the same from the self-compiled version on the pi (on my ubuntu and mac boxes it looks similar)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.02.02' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
 </head>
 <body>

I will try to find out more about what makes the output of those two different. Currently on my first pi, where I installed everything to compile tesseract myself, I get the same (correct) output as the second snippet. Even for the tesseract from the raspbian repositories.

Only on the installed-from-scratch-raspbian do I get the not-so-correct output from the first snippet. So it may not be the tesseract-version but some library used in tesseract. I will report back when I know more.

virantha · 2013-12-31T22:10:13Z

Closing this as this is probably not an issue for most people.

virantha closed this as completed Dec 31, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Needs current version of tesseract #2

Needs current version of tesseract #2

ghost commented Dec 7, 2013

virantha commented Dec 8, 2013

ghost commented Dec 8, 2013

ghost commented Dec 8, 2013

virantha commented Dec 31, 2013

Needs current version of tesseract #2

Needs current version of tesseract #2

Comments

ghost commented Dec 7, 2013

virantha commented Dec 8, 2013

ghost commented Dec 8, 2013

ghost commented Dec 8, 2013

virantha commented Dec 31, 2013