Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract HOCR from searchable PDF #117

Open
thwfqecj opened this issue Aug 1, 2017 · 12 comments
Open

Extract HOCR from searchable PDF #117

thwfqecj opened this issue Aug 1, 2017 · 12 comments

Comments

@thwfqecj
Copy link

thwfqecj commented Aug 1, 2017

Thank you so much with your great works!

But I wonder if it is possible to extract HOCR from searchable PDF, I mean, PDFs that are already combined with HOCR, I haven't find any tools to do that for me...

@stweil
Copy link
Collaborator

stweil commented Aug 1, 2017

Nor do I know such tools. The tool pdftohtml can extract XML from PDF, and there is an issue for ocr-fileformats to convert that XML to hOCR, so that combined tools would do the job.

@jsbien
Copy link

jsbien commented Aug 1, 2017 via email

@thwfqecj
Copy link
Author

thwfqecj commented Sep 5, 2017

@stweil @jsbien

Thanks for your comment!
I'm trying to use pdftohtml. Actually I want to make my search-able pdf slimmer. It seems I have to change all the pages to single html files and then back to pdf...But it works!

As for ocrodjvu,

ocrodjvu is a wrapper for OCR systems, that allows you to perform OCR on DjVu files.

It looks like a OCR tool, not fitting my needs. But Thank you.

@jsbien
Copy link

jsbien commented Sep 5, 2017

ocrodjvu is distributed with djvu2hocr which is what you may need,

@giancarlobi
Copy link

Many thanks @jsbien pdf2djvu + djvu2hocr works like a charm !!

@thwfqecj
Copy link
Author

@jsbien Thank you for your kindly reply! Sorry for I didn't see ocrodjvu on its official site...Guess it should works!

@jsbien
Copy link

jsbien commented Oct 18, 2017 via email

@JensHumrich
Copy link

Hey,
I stumbled upon this old thread. I can confirm that the solution works...

pdf2djvu -o test.djvu test.pdf
python2 /mnt/mem/temp/ocrodjvu/ocrodjvu test.djvu -o ocrfile
python2 /mnt/mem/temp/ocrodjvu/djvu2hocr ocrfile > output.hocr

@jsbien
Copy link

jsbien commented Jan 31, 2019

For a searchable PDF the second step should be skipped, otherwise instead of the original text you get the result of OCR.

@JensHumrich
Copy link

Wow. Thanks a lot. This is really an important information.

@mattdeeperinsights
Copy link

mattdeeperinsights commented Oct 12, 2021

I would recommend using Python package pdftotree to get the hocr automatically, it's so easy.

Get requirements:

Pip the package: pip3 install pdftotree and then it's as simple as this:

import pdftotree
hocr_result = pdftotree.parse('path/to/your.pdf')

Enjoy.

@rmast
Copy link

rmast commented Jan 8, 2022

Hey, I stumbled upon this old thread. I can confirm that the solution works...

pdf2djvu -o test.djvu test.pdf
python2 /mnt/mem/temp/ocrodjvu/ocrodjvu test.djvu -o ocrfile
python2 /mnt/mem/temp/ocrodjvu/djvu2hocr ocrfile > output.hocr

I can now say it doesn't work for either a PDF or a DjVu with searchable text coming from GScan2PDF.
Te resulting ocrfile is still big and containing pages that are mentioned by djvu2hocr, however the resulting output.hocr contains nothing. The contents of the file look like a djvu, and renamed to djvu are viewable by a djvu-viewer. They show no hidden text.

To get the HOCR from the searchable DjVu just apply djvu2hocr on the djvu and skip ocrodjvu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants