Recognize page content of a PDF as text Tesseract and Ghostscript.
- Install Visual Studio 2015 Runtime (both x86 & x64)
- Install Ghostscript (x86 or x64, depending on your computer)
- Clone or download this repository.
- Open the solution in Visual Studio and run
Install-Package Tesseract -Version 3.0.2from thePackage Manager Console. - Download language data files for tesseract 3.04 from the tessdata repository and add them to the
tessdatafolder of your project. SetCopy to output directorytoAlwaysfor all the copied files. You can copy only the language files you are interested in (e.g. all the files that starts withengfor English language).
| Variable name | Default | Description | |
|---|---|---|---|
| Input PDF file | inputPdfFile |
test.pdf, included in the repository |
The PDF file whose selected page's content will be recognized as text. |
| Page number | pageNumber |
1 |
The number of the page whose content will be recognized as text. |
| Recognition language | ocrLanguage |
"eng" |
The language used from tesseract to recognize text. When you change this value, make shure you add the language data files to the tessdata folder. See Installation section. |
| DPI converting PDF page to image | pdfToImageDPI |
150 |
Tesseract can't recognize text from PDF pages. This is way we have to convert the PDF page to an image. This property indicates the DPI when making this convertion. |
If you need more information on Tesseract usage, please visit its own repository.