A comprehensive PDF content extraction library with support for text, images, and structured data.
- 📝 Text Extraction - High-quality text extraction with positioning
- 🖼️ Image Detection - Detect and reference images in PDF content
- 💾 Image File Extraction - Extract actual image files from PDFs
- 🎨 Image Optimization - Optional Sharp/Imagemin optimization with quality control
- 🔄 JP2 Conversion - Automatic JPEG 2000 to JPG conversion for compatibility
- 🚀 Parallel Processing - 1.5-3x faster with configurable concurrency (Phase 1)
- ⚡ Async I/O - Non-blocking file operations for better performance (Phase 2)
- 🧵 Worker Threads - True multi-threading for CPU-intensive operations (Phase 3)
- 🌊 Streaming API - Process large PDFs with 10-100x lower memory usage (Phase 4)
- 📄 Page to Image - Convert PDF pages to images (PNG, JPG, WebP) (Phase 5 - NEW!)
- 🎯 Format Preservation - Preserves original image formats (JPG, PNG) and full quality
- 🔧 TypeScript Support - Full TypeScript definitions included
- 🛡️ Robust Validation - Comprehensive input validation and error handling
# Using pnpm (recommended)
pnpm add pdf-plus
# Using npm
npm install pdf-plus
# Using yarn
yarn add pdf-plusimport { extractPdfContent } from "pdf-plus";
// Extract both text and images
const result = await extractPdfContent("document.pdf", {
extractText: true,
extractImages: true,
verbose: true,
});
console.log(
`Extracted ${result.images.length} images from ${result.document.pages} pages`
);
console.log(`Text content: ${result.cleanText.substring(0, 100)}...`);For large PDFs, use the streaming API for lower memory usage and real-time progress:
import { extractPdfStream } from "pdf-plus";
const stream = extractPdfStream("large-document.pdf", {
extractImageFiles: true,
imageOutputDir: "./images",
streamMode: true,
});
for await (const event of stream) {
if (event.type === "page") {
console.log(`Page ${event.pageNumber}/${event.totalPages} complete`);
} else if (event.type === "progress") {
console.log(`Progress: ${event.percentComplete.toFixed(1)}%`);
} else if (event.type === "complete") {
console.log(`Done! ${event.totalImages} images extracted`);
}
}Benefits:
- 📉 10-100x lower memory usage for large PDFs
- ⚡ 100x faster time to first result
- 📊 Real-time progress tracking
- 🛑 Cancellation support
See PHASE4-STREAMING.md for complete streaming API documentation.
Render PDF pages to high-quality images with a simple function call:
import { generatePageImages } from "pdf-plus";
// Simple - render all pages to JPG images
const imagePaths = await generatePageImages(
"document.pdf", // PDF file path
"./page-images" // Output directory where images will be saved
);
console.log(`Generated ${imagePaths.length} page images`);
// Returns: ['/path/to/page-images/jpg/page-001.jpg', '/path/to/page-images/jpg/page-002.jpg', ...]With Options:
const imagePaths = await generatePageImages("document.pdf", "./page-images", {
pageImageFormat: "jpg", // 'jpg', 'png', or 'webp'
pageImageDpi: 150, // DPI quality (72, 150, 300, 600)
pageRenderEngine: "poppler", // 'poppler' (recommended) or 'pdfjs'
specificPages: [1, 2, 3], // Optional: only render specific pages
parallelProcessing: true, // Parallel rendering (default: true)
maxConcurrentPages: 10, // Max parallel pages (default: 10)
verbose: true, // Show progress
});Features:
- 🎨 Multiple formats - JPG, PNG, WebP
- 📐 Quality control - Adjustable DPI (72, 150, 300, 600)
- 📄 Page selection - Render specific pages or all pages
- 🚀 Parallel rendering - Fast multi-page processing
- 📁 Returns file paths - Array of absolute paths to generated images
- 🔧 Two engines - Poppler (best quality) or PDF.js
Output Structure:
page-images/
└── jpg/
├── page-001.jpg
├── page-002.jpg
└── page-003.jpg
See PAGE-TO-IMAGE-FEATURE.md for complete page-to-image documentation.
import { extractText } from "pdf-plus";
const text = await extractText("document.pdf");
console.log(`Extracted ${text.length} characters`);import { extractImageFiles } from "pdf-plus";
// Extract and save embedded images from PDF
const imagePaths = await extractImageFiles(
"document.pdf",
"./extracted-images" // Output directory for embedded images
);
console.log(`Extracted ${imagePaths.length} embedded images`);import { generatePageImages } from "pdf-plus";
// Render PDF pages to image files
const imagePaths = await generatePageImages(
"document.pdf",
"./page-images" // Output directory for page images
);
console.log(`Generated ${imagePaths.length} page images`);
// Each page becomes an image: page-001.jpg, page-002.jpg, etc.import { extractPdfContent } from "pdf-plus";
const result = await extractPdfContent("document.pdf", {
extractImageFiles: true,
imageOutputDir: "./images",
// Enable optimization
optimizeImages: true,
imageOptimizer: "auto", // or 'sharp', 'imagemin'
imageQuality: 80,
imageProgressive: true,
// Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true)
convertJp2ToJpg: true,
imageQuality: 100, // Default: 100 for JP2 conversion (max quality)
verbose: true,
});
// Check optimization results
result.images.forEach((img) => {
console.log(`${img.filename}: Optimized and saved`);
});import { extractPdfContent } from "pdf-plus";
// BASIC: Parallel processing (enabled by default)
const result = await extractPdfContent("document.pdf", {
extractImageFiles: true,
imageOutputDir: "./images",
parallelProcessing: true, // 1.5-3x faster
});
// ADVANCED: With worker threads for CPU-intensive operations
const result = await extractPdfContent("large-document.pdf", {
extractImageFiles: true,
imageOutputDir: "./images",
// Enable parallel processing (default: true)
parallelProcessing: true,
// Enable worker threads for true multi-threading (default: false)
useWorkerThreads: true, // 2.5-3.2x additional speedup!
autoScaleWorkers: true, // Auto-adjust based on system resources
maxWorkerThreads: 8, // Max worker threads (default: CPU cores - 1)
// Fine-tune concurrency for your workload
maxConcurrentPages: 20, // Process up to 20 pages simultaneously
maxConcurrentImages: 50, // Extract up to 50 images per page in parallel
maxConcurrentConversions: 5, // Convert up to 5 JP2 files simultaneously
maxConcurrentOptimizations: 5, // Optimize up to 5 images simultaneously
verbose: true,
});
// Performance gains (tested on Art Basel PDF, 54 images):
// - Baseline (sequential): 140ms
// - Parallel processing: 47ms (2.96x faster)
// - Parallel + Workers: 44ms (3.23x faster) 🚀Performance Recommendations:
| PDF Size | Images | Recommended Settings |
|---|---|---|
| Small | <20 | parallelProcessing: true (default settings) |
| Medium | 20-50 | parallelProcessing: true, maxConcurrentPages: 10, maxConcurrentImages: 20 |
| Large | 50+ | parallelProcessing: true, useWorkerThreads: true, maxConcurrentPages: 20, maxConcurrentImages: 50 |
| Huge | 200+ | parallelProcessing: true, useWorkerThreads: true, maxWorkerThreads: 8, maxConcurrentPages: 30, maxConcurrentImages: 100 |
Worker Threads Benefits:
- ✅ True multi-threading (runs on separate CPU cores)
- ✅ 2.5-3.2x faster for CPU-intensive operations (JP2 conversion, optimization)
- ✅ Auto-scaling based on memory and CPU usage
- ✅ Opt-in (default: false) - no breaking changes
See PERFORMANCE.md and PHASE3-WORKERS.md for detailed benchmarks and optimization guide.
import { extractPdfContent } from "pdf-plus";
const result = await extractPdfContent("document.pdf", {
imageRefFormat: "📷 Image {index} on page {page}",
extractImageFiles: true,
useImagePaths: true,
});
// Text will contain: "📷 Image 1 on page 1" instead of "[IMAGE:img_1]"import { PDFExtractor } from "pdf-plus";
const extractor = new PDFExtractor();
const result = await extractor.extract("large-document.pdf", {
extractText: true,
extractImages: true,
extractImageFiles: true,
imageOutputDir: "./extracted-images",
memoryLimit: "1GB",
batchSize: 10,
progressCallback: (progress) => {
console.log(
`Processing page ${progress.currentPage}/${progress.totalPages}`
);
},
});import { extractPdfContent } from "pdf-plus";
import path from "path";
async function extractAcademicPaper(pdfPath: string) {
const result = await extractPdfContent(pdfPath, {
extractText: true,
extractImages: true,
extractImageFiles: true,
imageOutputDir: "./paper-images",
imageRefFormat: "Figure {index}: {name}",
verbose: true,
});
// Save text content
const fs = await import("fs");
fs.writeFileSync("./paper-text.txt", result.cleanText);
// Log extraction summary
console.log(`📄 Extracted from ${result.document.filename}:`);
console.log(` 📝 Text: ${result.document.textLength} characters`);
console.log(` 🖼️ Images: ${result.images.length} found`);
console.log(` 📊 Pages: ${result.document.pages}`);
return result;
}import { PDFExtractor } from "pdf-plus";
import { glob } from "glob";
async function batchProcessPDFs(pattern: string) {
const extractor = new PDFExtractor("./cache"); // Enable caching
const pdfFiles = await glob(pattern);
const results = [];
for (const pdfFile of pdfFiles) {
console.log(`Processing: ${pdfFile}`);
try {
const result = await extractor.extract(pdfFile, {
extractText: true,
extractImages: true,
imageOutputDir: `./output/${path.basename(pdfFile, ".pdf")}`,
batchSize: 5, // Process 5 pages at a time
verbose: false,
});
results.push({
file: pdfFile,
success: true,
pages: result.document.pages,
images: result.images.length,
textLength: result.document.textLength,
});
} catch (error) {
console.error(`Failed to process ${pdfFile}:`, error);
results.push({
file: pdfFile,
success: false,
error: error.message,
});
}
}
return results;
}Extract complete content from a PDF file.
Parameters:
pdfPath(string) - Path to the PDF fileoptions(ExtractionOptions) - Extraction configuration
Returns: Promise<ExtractionResult>
Extract only text content (optimized for speed).
Returns: Promise<string>
Extract only image references.
Returns: Promise<ImageItem[]>
Extract and save embedded image files from PDF.
Parameters:
pdfPath- Path to the PDF fileoutputDir- Output directory path where embedded images will be savedoptions- Optional extraction options
Returns: Promise<string[]> - Array of saved file paths
Render PDF pages to image files (page-to-image conversion).
Parameters:
pdfPath- Path to the PDF fileoutputDir- Output directory path where page images will be savedoptions- Optional rendering options (pageImageFormat, pageImageDpi, pageRenderEngine, etc.)
Returns: Promise<string[]> - Array of absolute paths to generated page images
Example:
import { generatePageImages } from "pdf-plus";
const imagePaths = await generatePageImages("document.pdf", "./page-images", {
pageImageFormat: "jpg",
pageImageDpi: 150,
pageRenderEngine: "poppler",
});
console.log(`Generated ${imagePaths.length} page images`);
// Returns: ['/absolute/path/to/page-images/jpg/page-001.jpg', ...]interface ExtractionOptions {
// Basic extraction options
extractText?: boolean; // Extract text content (default: true)
extractImages?: boolean; // Extract image references (default: true)
extractImageFiles?: boolean; // Save actual image files (default: false)
useImagePaths?: boolean; // Use file paths in references (default: false)
imageOutputDir?: string; // Directory for image files (default: './extracted-images')
imageRefFormat?: string; // Custom reference format (default: '[IMAGE:{id}]')
baseName?: string; // Base name for output files
verbose?: boolean; // Show detailed progress (default: false)
memoryLimit?: string; // Memory limit (e.g., '512MB', '1GB')
batchSize?: number; // Pages per batch (1-100)
progressCallback?: (progress: ProgressInfo) => void;
// Image optimization options
optimizeImages?: boolean; // Enable image optimization (default: false)
imageOptimizer?: "auto" | "sharp" | "imagemin"; // Optimizer to use (default: 'auto')
imageQuality?: number; // Image quality 1-100 (default: 80, JP2 conversion: 100)
imageProgressive?: boolean; // Progressive JPEG (default: true)
convertJp2ToJpg?: boolean; // Convert JP2 to JPG (default: true)
// Performance options (NEW!)
parallelProcessing?: boolean; // Enable parallel processing (default: true)
maxConcurrentPages?: number; // Max pages in parallel (default: 10)
maxConcurrentImages?: number; // Max images per page in parallel (default: 20)
maxConcurrentConversions?: number; // Max JP2 conversions in parallel (default: 5)
maxConcurrentOptimizations?: number; // Max optimizations in parallel (default: 5)
// Worker thread options (NEW! 🚀)
useWorkerThreads?: boolean; // Enable worker threads (default: false)
autoScaleWorkers?: boolean; // Auto-scale workers (default: true)
maxWorkerThreads?: number; // Max worker threads (default: CPU cores - 1)
minWorkerThreads?: number; // Min worker threads (default: 1)
memoryThreshold?: number; // Memory threshold 0-1 (default: 0.8)
cpuThreshold?: number; // CPU threshold 0-1 (default: 0.9)
workerTaskTimeout?: number; // Task timeout ms (default: 30000)
workerIdleTimeout?: number; // Idle timeout ms (default: 60000)
workerMemoryLimit?: number; // Memory per worker MB (default: 512)
enableWorkerForConversion?: boolean; // Workers for JP2 (default: true)
enableWorkerForOptimization?: boolean; // Workers for optimization (default: true)
enableWorkerForDecoding?: boolean; // Workers for decoding (default: true)
}Performance Options Explained:
Parallel Processing:
parallelProcessing: Enable/disable parallel processing. Enabled by default for 1.5-3x speedup.maxConcurrentPages: How many pages to process simultaneously. Higher values = faster for multi-page PDFs, but more memory usage.maxConcurrentImages: How many images per page to extract in parallel. Increase for pages with many images.maxConcurrentConversions: How many JP2→JPG conversions to run simultaneously. Keep moderate (5-10) to avoid memory issues.maxConcurrentOptimizations: How many image optimizations to run simultaneously. Keep moderate (5-10) as optimization is CPU-intensive.
Worker Threads (NEW! 🚀):
useWorkerThreads: Enable true multi-threading using Node.js worker threads. Provides 2.5-3.2x additional speedup for CPU-intensive operations. Default:false(opt-in).autoScaleWorkers: Automatically adjust worker count based on system memory and CPU usage. Default:true.maxWorkerThreads: Maximum number of worker threads. Default: CPU cores - 1.minWorkerThreads: Minimum number of worker threads to keep alive. Default: 1.memoryThreshold: Memory usage threshold (0-1) before scaling down workers. Default: 0.8 (80%).cpuThreshold: CPU usage threshold (0-1) before scaling down workers. Default: 0.9 (90%).workerTaskTimeout: Maximum time (ms) for a worker task before timeout. Default: 30000 (30 seconds).workerIdleTimeout: Time (ms) before idle workers are terminated. Default: 60000 (60 seconds).workerMemoryLimit: Memory limit (MB) per worker thread. Default: 512MB.enableWorkerForConversion: Use workers for JP2 conversion. Default:true.enableWorkerForOptimization: Use workers for image optimization. Default:true.enableWorkerForDecoding: Use workers for image decoding. Default:true.
Use these placeholders in imageRefFormat:
{id}- Unique image ID (e.g.,img_1){name}- Original image name from PDF{page}- Page number{index}- Global image index{path}- File path (whenextractImageFilesis true)
Examples:
[IMAGE:{id}]→[IMAGE:img_1]📷 Image {index}→📷 Image 1{name} on page {page}→artwork_1 on page 5<img src="{path}">→<img src="./images/img_1.jpg">
Extract and optimize images in one step using Sharp or Imagemin:
import { extractPdfContent } from "pdf-plus";
const result = await extractPdfContent("document.pdf", {
extractImageFiles: true,
imageOutputDir: "./images",
// Enable optimization
optimizeImages: true,
imageOptimizer: "auto", // Automatically selects best available
imageQuality: 80,
imageProgressive: true,
// Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true)
convertJp2ToJpg: true,
verbose: true,
});
// Output:
// 🖼️ Extracting images from: document.pdf
// 📊 Processing 50 pages with PDF-lib engine
// 💾 Extracted real image: img_p1_1.jpg (245KB)
// 🔄 Converting 16 JP2 images to JPG...
// 🔄 Converted JP2 → JPG: img_p2_2.jpg (24026 → 18500 bytes)
// 🎨 Optimizing 54 images...
// ✅ img_p1_1.jpg: 251904 → 184320 bytes (-26.8%) [sharp]
// ✅ img_p2_2.jpg: 18500 → 15200 bytes (-17.8%) [sharp]JP2 (JPEG 2000) files are not widely supported by browsers and image tools. The library automatically converts them to standard JPG format:
const result = await extractPdfContent("document.pdf", {
extractImageFiles: true,
convertJp2ToJpg: true, // Default: true
imageQuality: 100, // Default: 100 (maximum quality preservation)
});
// All JP2 images are now JPG files with better compatibilityQuality Preservation:
- Default quality: 100 - Preserves maximum quality from JP2
- Use lower values (80-90) if you want additional compression
- Original JP2 files are deleted after successful conversion
Benefits:
- ✅ Better browser compatibility
- ✅ Can be optimized by Sharp/Imagemin
- ✅ Maximum quality preserved (quality=100)
- ✅ Works everywhere
| Optimizer | Speed | Quality | Formats | Platform |
|---|---|---|---|---|
sharp |
Fast | Excellent | JPG, PNG, WebP | Native (requires compilation) |
imagemin |
Medium | Excellent | JPG, PNG, GIF, SVG | Cross-platform |
auto |
Variable | Excellent | All supported | Tries sharp first, falls back to imagemin |
// Maximum compression (slower, smaller files)
const result = await extractPdfContent("document.pdf", {
optimizeImages: true,
imageQuality: 70,
});
// Balanced (recommended)
const result = await extractPdfContent("document.pdf", {
optimizeImages: true,
imageQuality: 80, // Default
});
// Fast optimization with Sharp
const result = await extractPdfContent("document.pdf", {
optimizeImages: true,
imageOptimizer: "sharp",
imageQuality: 85,
});const text = await extractText("document.pdf");
// ~40% faster than combined modeconst images = await extractImages("document.pdf");
// ~20% faster than combined modeconst result = await extractPdfContent("document.pdf");
// Full extraction with text and image referencesimport { extractPdfContent } from "pdf-plus";
try {
const result = await extractPdfContent("document.pdf");
} catch (error) {
if (error.code === "VALIDATION_ERROR") {
console.error("Configuration error:", error.validationErrors);
} else if (error.code === "EXTRACTION_ERROR") {
console.error("Extraction failed:", error.message);
} else {
console.error("Unexpected error:", error);
}
}# Install dependencies
pnpm install
# Build the library
pnpm run build
# Lint and format
pnpm run lint:fix
pnpm run format
# Type checking
pnpm run check- Node.js >= 18.0.0
- TypeScript >= 5.0 (for development)
MIT
Contributions are welcome! Please read our contributing guidelines and submit pull requests to our repository.
Make sure you're using the correct import syntax for your environment:
// ESM (recommended)
import { extractPdfContent } from "pdf-plus";
// CommonJS
const { extractPdfContent } = require("pdf-plus");For large documents, use streaming options:
const result = await extractPdfContent("large-document.pdf", {
memoryLimit: "512MB",
batchSize: 5,
useCache: true,
});Try different engines:
const result = await extractPdfContent("document.pdf", {
imageEngine: "poppler", // or 'pdf-lib', 'auto'
extractImageFiles: true,
});Some PDFs may have encoding issues. Try:
const result = await extractPdfContent("document.pdf", {
extractText: true,
textEngine: "pdfjs", // Alternative engine
verbose: true, // See detailed logs
});-
Use specific extraction modes for better performance:
// Text only (fastest) const text = await extractText("document.pdf"); // Images only const images = await extractImages("document.pdf");
-
Enable caching for repeated operations:
const extractor = new PDFExtractor("./cache");
-
Process pages in batches for large documents:
const result = await extractPdfContent("large.pdf", { batchSize: 10, memoryLimit: "1GB", });
- Check the Issues page
- Review examples for common use cases
- Enable verbose logging for debugging:
{ verbose: true }
- OCR Support: Text extraction from image-based PDFs
- Advanced Text Analysis: Font detection, text classification
- Streaming API: Process large documents efficiently
- Cloud Integration: Direct integration with cloud storage
- CLI Tool: Command-line interface for batch processing
- Web Worker Support: Browser-based extraction
- Plugin System: Extensible architecture for custom extractors
- OCR integration with Tesseract.js
- Advanced image processing options
- Streaming extraction API
- Performance optimizations
- Browser compatibility layer
- CLI tool development
See CHANGELOG.md for detailed version history.