Convert image-only PDFs (scanned documents) to searchable PDFs with selectable text using OCR.
- Local Processing - All OCR happens server-side with Tesseract.js (no external APIs needed)
- Preserves Layout - Original images retained, invisible text layer added on top
- High Accuracy - Word-level bounding boxes ensure precise text positioning
- Modern Stack - Next.js 16, React 19, TypeScript, Tailwind CSS, shadcn/ui
- Modular Design - Core OCR engine can be used independently of the web app
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 1. UPLOAD │ │ 2. OCR │ │ 3. DOWNLOAD │
│ Scanned PDF │ ──► │ Extract text │ ──► │ Searchable PDF │
│ (images only) │ │ + positions │ │ (text layer) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- PDF to Images - Each page rendered as PNG at 1.5x scale using
pdf-to-img - OCR Processing - Tesseract.js extracts text with word-level bounding boxes
- Coordinate Mapping - Image coordinates transformed to PDF coordinate system
- Text Layer - Invisible text placed exactly over visible content using
pdf-lib - Output - Original PDF preserved with searchable text layer embedded
- Node.js 20+ (required for Next.js 16)
- npm or pnpm
# Clone the repo
git clone https://github.com/ajjucoder/pdf-ocr-engine.git
cd pdf-ocr-engine
# Install dependencies
npm install
# Run dev server
npm run devOpen http://localhost:3000 to use the app.
npm run dev # Start development server
npm run build # Build for production
npm run start # Start production server
npm run lint # Run ESLint
npm test # Run Vitest tests
npm run test:watch # Run tests in watch modepdf-ocr-engine/
├── src/
│ ├── app/
│ │ ├── api/convert/route.ts # POST /api/convert endpoint
│ │ ├── layout.tsx # Root layout
│ │ └── page.tsx # Landing page
│ ├── components/
│ │ ├── ui/ # shadcn/ui components
│ │ └── pdf-uploader.tsx # Main upload component
│ └── lib/
│ └── ocr/ # Core OCR engine
│ ├── index.ts # Main orchestration
│ ├── types.ts # TypeScript interfaces
│ ├── ocr.ts # Tesseract.js integration
│ ├── builder.ts # PDF construction
│ └── extractor.ts # PDF metadata extraction
├── BUGFIXES.md # Detailed bug fix history
└── package.json
Convert an image-only PDF to a searchable PDF with OCR-extracted text layer.
Request:
Content-Type: multipart/form-data
Parameters:
pdf (File, required) - The PDF file to process
language (string, optional) - OCR language code (default: "eng")
Response (Success - 200):
Content-Type: application/pdf
Content-Disposition: attachment; filename="searchable-{original-name}.pdf"
Body: Binary PDF data
Response (Error - 400/500):
{ "error": "Error message" }Example:
const formData = new FormData()
formData.append("pdf", pdfFile)
formData.append("language", "eng")
const response = await fetch("/api/convert", {
method: "POST",
body: formData,
})
if (response.ok) {
const blob = await response.blob()
// Download the searchable PDF
}The OCR engine is modular and can be used independently:
import { convertPdfToSearchable } from "@/lib/ocr"
const pdfBuffer = Buffer.from(await file.arrayBuffer())
const imageBuffers = [/* PNG buffers from pdf-to-img */]
const result = await convertPdfToSearchable(
pdfBuffer,
imageBuffers,
{
language: "eng",
preserveImages: true,
},
(progress) => {
console.log(`${progress.stage}: ${progress.percentage}%`)
}
)
if (result.success) {
const searchablePdf = result.outputBuffer
// Save or return the PDF
}| Category | Technology |
|---|---|
| Framework | Next.js 16 with App Router |
| UI | React 19, shadcn/ui, Tailwind CSS 4 |
| OCR | Tesseract.js 7 |
| PDF Processing | pdf-lib, pdf-to-img, Sharp |
| Language | TypeScript 5 (strict mode) |
| Testing | Vitest, Testing Library |
PDF Buffer
│
▼
┌─────────────────────────────────────────────────────────┐
│ pdf-to-img (scale: 1.5) │
│ Renders each page as PNG image │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Tesseract.js (with { blocks: true }) │
│ Extracts: text, confidence, bounding boxes │
│ Structure: blocks → paragraphs → lines → words │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Coordinate Transformation │
│ • Scale: image pixels → PDF points │
│ • Y-axis flip: top-origin → bottom-origin │
│ • Baseline offset: +20% for text alignment │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ pdf-lib │
│ • Copy original pages (preserves images) │
│ • Draw invisible text at calculated positions │
│ • Save searchable PDF │
└─────────────────────────────────────────────────────────┘
- Singleton Worker - Tesseract worker created once, reused for all pages (64x faster)
- Invisible Text Layer - Text with
opacity: 0placed over images for selection - Baseline Offset - 20% adjustment because PDF
drawText()uses baseline, not bbox bottom - Defensive Checks - Validation for dimensions, text width, edge cases
See BUGFIXES.md for detailed history. Key fixes:
| Issue | Root Cause | Solution |
|---|---|---|
| Text not positioned correctly | Tesseract.js v7 API change | Enable { blocks: true }, extract from nested structure |
| Text shifted downward | Used bbox bottom instead of baseline | Added 20% baseline offset |
| Invisible text layer | Hardcoded width: 0, height: 0 |
Use Sharp for actual dimensions |
- MCP server for Claude/OpenCode integration
- Multi-language OCR support
- Progress streaming via Server-Sent Events
- Batch processing for multiple PDFs
- Cloud OCR fallback for better accuracy
- Memory leak fix for Blob URLs
- Rate limiting and security hardening
MIT