PDF OCR Engine

Convert image-only PDFs (scanned documents) to searchable PDFs with selectable text using OCR.

Features

Local Processing - All OCR happens server-side with Tesseract.js (no external APIs needed)
Preserves Layout - Original images retained, invisible text layer added on top
High Accuracy - Word-level bounding boxes ensure precise text positioning
Modern Stack - Next.js 16, React 19, TypeScript, Tailwind CSS, shadcn/ui
Modular Design - Core OCR engine can be used independently of the web app

How It Works

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  1. UPLOAD      │     │  2. OCR         │     │  3. DOWNLOAD    │
│  Scanned PDF    │ ──► │  Extract text   │ ──► │  Searchable PDF │
│  (images only)  │     │  + positions    │     │  (text layer)   │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Technical Flow

PDF to Images - Each page rendered as PNG at 1.5x scale using pdf-to-img
OCR Processing - Tesseract.js extracts text with word-level bounding boxes
Coordinate Mapping - Image coordinates transformed to PDF coordinate system
Text Layer - Invisible text placed exactly over visible content using pdf-lib
Output - Original PDF preserved with searchable text layer embedded

Getting Started

Prerequisites

Node.js 20+ (required for Next.js 16)
npm or pnpm

Installation

# Clone the repo
git clone https://github.com/ajjucoder/pdf-ocr-engine.git
cd pdf-ocr-engine

# Install dependencies
npm install

# Run dev server
npm run dev

Open http://localhost:3000 to use the app.

Available Scripts

npm run dev          # Start development server
npm run build        # Build for production
npm run start        # Start production server
npm run lint         # Run ESLint
npm test             # Run Vitest tests
npm run test:watch   # Run tests in watch mode

Project Structure

pdf-ocr-engine/
├── src/
│   ├── app/
│   │   ├── api/convert/route.ts    # POST /api/convert endpoint
│   │   ├── layout.tsx              # Root layout
│   │   └── page.tsx                # Landing page
│   ├── components/
│   │   ├── ui/                     # shadcn/ui components
│   │   └── pdf-uploader.tsx        # Main upload component
│   └── lib/
│       └── ocr/                    # Core OCR engine
│           ├── index.ts            # Main orchestration
│           ├── types.ts            # TypeScript interfaces
│           ├── ocr.ts              # Tesseract.js integration
│           ├── builder.ts          # PDF construction
│           └── extractor.ts        # PDF metadata extraction
├── BUGFIXES.md                     # Detailed bug fix history
└── package.json

API Documentation

POST /api/convert

Convert an image-only PDF to a searchable PDF with OCR-extracted text layer.

Request:

Content-Type: multipart/form-data

Parameters:
  pdf      (File, required)   - The PDF file to process
  language (string, optional) - OCR language code (default: "eng")

Response (Success - 200):

Content-Type: application/pdf
Content-Disposition: attachment; filename="searchable-{original-name}.pdf"

Body: Binary PDF data

Response (Error - 400/500):

{ "error": "Error message" }

Example:

const formData = new FormData()
formData.append("pdf", pdfFile)
formData.append("language", "eng")

const response = await fetch("/api/convert", {
  method: "POST",
  body: formData,
})

if (response.ok) {
  const blob = await response.blob()
  // Download the searchable PDF
}

Using the OCR Engine Programmatically

The OCR engine is modular and can be used independently:

import { convertPdfToSearchable } from "@/lib/ocr"

const pdfBuffer = Buffer.from(await file.arrayBuffer())
const imageBuffers = [/* PNG buffers from pdf-to-img */]

const result = await convertPdfToSearchable(
  pdfBuffer,
  imageBuffers,
  {
    language: "eng",
    preserveImages: true,
  },
  (progress) => {
    console.log(`${progress.stage}: ${progress.percentage}%`)
  }
)

if (result.success) {
  const searchablePdf = result.outputBuffer
  // Save or return the PDF
}

Tech Stack

Category	Technology
Framework	Next.js 16 with App Router
UI	React 19, shadcn/ui, Tailwind CSS 4
OCR	Tesseract.js 7
PDF Processing	pdf-lib, pdf-to-img, Sharp
Language	TypeScript 5 (strict mode)
Testing	Vitest, Testing Library

Architecture

OCR Pipeline

PDF Buffer
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  pdf-to-img (scale: 1.5)                                │
│  Renders each page as PNG image                         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Tesseract.js (with { blocks: true })                   │
│  Extracts: text, confidence, bounding boxes             │
│  Structure: blocks → paragraphs → lines → words         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Coordinate Transformation                              │
│  • Scale: image pixels → PDF points                     │
│  • Y-axis flip: top-origin → bottom-origin              │
│  • Baseline offset: +20% for text alignment             │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  pdf-lib                                                │
│  • Copy original pages (preserves images)               │
│  • Draw invisible text at calculated positions          │
│  • Save searchable PDF                                  │
└─────────────────────────────────────────────────────────┘

Key Design Decisions

Singleton Worker - Tesseract worker created once, reused for all pages (64x faster)
Invisible Text Layer - Text with opacity: 0 placed over images for selection
Baseline Offset - 20% adjustment because PDF drawText() uses baseline, not bbox bottom
Defensive Checks - Validation for dimensions, text width, edge cases

Recent Bug Fixes

See BUGFIXES.md for detailed history. Key fixes:

Issue	Root Cause	Solution
Text not positioned correctly	Tesseract.js v7 API change	Enable `{ blocks: true }`, extract from nested structure
Text shifted downward	Used bbox bottom instead of baseline	Added 20% baseline offset
Invisible text layer	Hardcoded `width: 0, height: 0`	Use Sharp for actual dimensions

Future Improvements

MCP server for Claude/OpenCode integration
Multi-language OCR support
Progress streaming via Server-Sent Events
Batch processing for multiple PDFs
Cloud OCR fallback for better accuracy
Memory leak fix for Blob URLs
Rate limiting and security hardening

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
public		public
src		src
.gitignore		.gitignore
BUGFIXES.md		BUGFIXES.md
README.md		README.md
components.json		components.json
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts
vitest.setup.ts		vitest.setup.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF OCR Engine

Features

How It Works

Technical Flow

Getting Started

Prerequisites

Installation

Available Scripts

Project Structure

API Documentation

POST /api/convert

Using the OCR Engine Programmatically

Tech Stack

Architecture

OCR Pipeline

Key Design Decisions

Recent Bug Fixes

Future Improvements

License

About

Uh oh!

Releases

Packages

Languages

ajjucoder/pdf-ocr-engine

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Engine

Features

How It Works

Technical Flow

Getting Started

Prerequisites

Installation

Available Scripts

Project Structure

API Documentation

POST /api/convert

Using the OCR Engine Programmatically

Tech Stack

Architecture

OCR Pipeline

Key Design Decisions

Recent Bug Fixes

Future Improvements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages