pdf-ocr-parser

I came across a problem, where I have to read some info written in kannada language (one of the widely spoken language of southern state of India) from a scanned pdf or a picture based pdf.

So this project solves this problem using 3 approaches.

using js + tesseract.js + ollama (Qwen2.5) ( interacts with Tesseract ) inside v1/
using shell + Tesseract ( c++ package which has to be installed in your system first ) inside v2
using python + googletrans lib ( not-recommended ) inside v3/

Licenses

Installation

there are 2 ways/approaches to run this application, v1 and v2. Both of this application has their own set up procedure.

  git clone https://github.com/abhaysinghs772/pdf-ocr-parser.git

move to the cloned folder

  cd pdf-ocr-parser/

v1 [ using js and ollama (Qwen2.5) model ]

  cd v1/

v2 [ using raw shell scripts and Tesseract (c++ package) ]

  cd v3/

v3 [ using google-trans python lib ( NOT-RECOMMENDED )]

  cd v2/

Acknowledgements

Feedback

If you have any feedback or any issue, then please feel free to open the issues, or reach out to me directly at abhaysinghs772@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
v1		v1
v2		v2
v3		v3
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
kan.traineddata		kan.traineddata
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdf-ocr-parser

Licenses

Installation

Acknowledgements

Feedback

About

Uh oh!

Releases

Packages

Languages

abhaysinghs772/pdf-ocr-parser

Folders and files

Latest commit

History

Repository files navigation

pdf-ocr-parser

Licenses

Installation

Acknowledgements

Feedback

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages