Skip to content

link-foundation/pdf-to-markdown

Repository files navigation

pd2md

A best-in-class JavaScript/TypeScript library for converting PDF documents to Markdown format. Built with extensibility and configurability in mind, supporting all major JavaScript runtimes (Node.js, Bun, Deno).

npm version License: Unlicense

Features

  • High-quality conversion: Preserves document structure, headings, lists, links, and formatting
  • Multiple input sources: Buffer, file path, URL, or readable stream
  • Configurable output: Customize heading detection, list handling, link extraction
  • Plugin architecture: Extend with custom transformers and post-processors
  • Multi-runtime support: Works with Node.js, Bun, and Deno
  • TypeScript-first: Full TypeScript definitions included
  • CLI tool: Command-line interface for batch processing
  • 100% test coverage: Comprehensive test suite
  • Zero configuration: Works out of the box with sensible defaults
  • Memory efficient: Stream-based processing for large files

Installation

# npm
npm install pd2md

# yarn
yarn add pd2md

# pnpm
pnpm add pd2md

# bun
bun add pd2md

Quick Start

import { pdfToMarkdown } from 'pd2md';
import fs from 'fs';

// From file path
const markdown = await pdfToMarkdown('./document.pdf');
console.log(markdown);

// From buffer
const buffer = fs.readFileSync('./document.pdf');
const markdown = await pdfToMarkdown(buffer);

// With options
const markdown = await pdfToMarkdown('./document.pdf', {
  preserveLinks: true,
  detectHeadings: true,
  pageBreaks: true,
});

API Reference

pdfToMarkdown(input, options?)

Converts a PDF to Markdown format.

Parameters:

  • input (string | Buffer | Uint8Array | URL): PDF source - file path, buffer, or URL
  • options (object, optional): Conversion options

Options:

Option Type Default Description
preserveLinks boolean true Extract and preserve hyperlinks
detectHeadings boolean true Detect headings based on font size
pageBreaks boolean false Add page break markers (---)
preserveLists boolean true Detect and format lists
maxPages number 0 Maximum pages to process (0 = all)
startPage number 1 First page to process
endPage number 0 Last page to process (0 = last)
verbose boolean false Enable verbose logging
plugins Plugin[] [] Custom transformation plugins
headingFontSize object {} Custom font size thresholds for heading levels
lineSpacing number 1.5 Line spacing threshold for paragraph separation

Returns: Promise<string> - The converted Markdown content

createConverter(options?)

Creates a reusable converter instance with preset options.

import { createConverter } from 'pd2md';

const converter = createConverter({
  preserveLinks: true,
  pageBreaks: true,
});

const md1 = await converter.convert('./doc1.pdf');
const md2 = await converter.convert('./doc2.pdf');

Plugins

Create custom plugins to extend the conversion process:

import { pdfToMarkdown } from 'pd2md';

const myPlugin = {
  name: 'custom-headers',
  // Transform text items before markdown generation
  transformItems: (items) => {
    return items.map((item) => {
      if (item.text.startsWith('Chapter')) {
        return { ...item, isHeading: true, level: 1 };
      }
      return item;
    });
  },
  // Post-process the generated markdown
  postProcess: (markdown) => {
    return markdown.replace(/Chapter (\d+)/g, '# Chapter $1');
  },
};

const markdown = await pdfToMarkdown('./book.pdf', {
  plugins: [myPlugin],
});

CLI Usage

# Convert single file
npx pd2md document.pdf -o output.md

# Convert directory
npx pd2md ./pdfs/ -o ./markdown/ --recursive

# With options
npx pd2md document.pdf --no-links --page-breaks

# Pipe from stdin
cat document.pdf | npx pd2md > output.md

CLI Options:

Usage: pd2md [input] [options]

Options:
  -o, --output <path>     Output file or directory
  -r, --recursive         Process directories recursively
  --no-links              Don't preserve hyperlinks
  --no-headings           Don't detect headings
  --page-breaks           Add page break markers
  --start-page <n>        First page to process
  --end-page <n>          Last page to process
  --verbose               Enable verbose output
  -h, --help              Show help
  -v, --version           Show version

Advanced Usage

Stream Processing

import { createReadStream } from 'fs';
import { pdfStreamToMarkdown } from 'pd2md';

const stream = createReadStream('./large-document.pdf');
const markdown = await pdfStreamToMarkdown(stream);

Progress Callbacks

import { pdfToMarkdown } from 'pd2md';

const markdown = await pdfToMarkdown('./document.pdf', {
  onProgress: ({ currentPage, totalPages }) => {
    console.log(`Processing page ${currentPage}/${totalPages}`);
  },
  onPageComplete: ({ page, content }) => {
    console.log(`Page ${page} converted: ${content.length} chars`);
  },
});

Custom Font Size Mapping

const markdown = await pdfToMarkdown('./document.pdf', {
  headingFontSize: {
    h1: 24, // Font size >= 24 becomes H1
    h2: 20, // Font size >= 20 becomes H2
    h3: 16, // Font size >= 16 becomes H3
    h4: 14, // Font size >= 14 becomes H4
  },
});

Competitors & Alternatives

This library was designed to be the most feature-complete PDF to Markdown converter. Here's how it compares to alternatives:

Library Language Stars License Plugins CLI TypeScript Active
pd2md (this) JavaScript - Unlicense Yes Yes Yes Yes
@opendocsg/pdf2md JavaScript 463 MIT No Yes No Yes
jzillmann/pdf-to-markdown JavaScript 1514 MIT No No No Yes
zyocum/pdf2md Python - - No Yes N/A Yes
leoneversberg/pdf2md_llm Python - - No Yes N/A Yes
FutureUnreal/mcp-pdf2md Python - - No Yes N/A Yes
Diselorya/pdf2md Python - - No Yes N/A Yes
VikParuchuri/marker Python - GPL No Yes N/A Yes

Feature Comparison

Feature pd2md @opendocsg/pdf2md jzillmann/pdf-to-markdown
Heading detection Yes Yes Yes
Link preservation Yes Yes Yes
List detection Yes Partial Partial
Table extraction Yes No No
Plugin system Yes No No
Streaming API Yes No No
Progress callbacks Yes No No
Custom font thresholds Yes No No
Page range selection Yes No No
Multi-runtime Yes Node.js only Browser only

Alternative Package Names

In case pd2md is not available or you prefer a different name, here are 25+ alternatives considered for this project:

  1. pdfmark - PDF to Markdown
  2. pdftomd - PDF to MD
  3. pdf2mark - PDF to Markdown
  4. pdfconv - PDF Converter
  5. mdconv - Markdown Converter
  6. pdf-mark - PDF Mark
  7. pd2mk - PD to MK
  8. p2md - P to MD
  9. pdf2m - PDF to M
  10. p2m - P to M
  11. to-md - To Markdown
  12. d2md - Document to MD
  13. dmkd - Document Markdown
  14. depdf - De-PDF
  15. re-pdf - Re-PDF
  16. pdf-doc - PDF Document
  17. doc-mark - Document Mark
  18. pdftext - PDF Text
  19. pdfread - PDF Read
  20. readpdf - Read PDF
  21. textpdf - Text PDF
  22. makemark - Make Markdown
  23. doc-to-md - Document to MD
  24. pdftex - PDF Text
  25. pdfmarkdown - PDF Markdown
  26. pdf-parse-md - PDF Parse MD
  27. md-from-pdf - MD from PDF
  28. unmake-pdf - Unmake PDF
  29. pdf-md-converter - PDF MD Converter

Project Structure

pd2md/
├── src/
│   ├── index.js           # Main entry point
│   ├── index.d.ts         # TypeScript definitions
│   ├── converter.js       # Core converter logic
│   ├── parser.js          # PDF parsing utilities
│   ├── transformer.js     # Text transformation
│   ├── plugins/           # Built-in plugins
│   │   ├── headings.js    # Heading detection
│   │   ├── links.js       # Link extraction
│   │   ├── lists.js       # List detection
│   │   └── tables.js      # Table extraction
│   └── cli.js             # CLI implementation
├── tests/
│   ├── converter.test.js  # Converter tests
│   ├── parser.test.js     # Parser tests
│   ├── plugins.test.js    # Plugin tests
│   └── fixtures/          # Test PDF files
├── examples/
│   ├── basic.js           # Basic usage
│   ├── advanced.js        # Advanced features
│   └── plugins.js         # Custom plugins
└── docs/
    ├── api.md             # API documentation
    └── plugins.md         # Plugin development guide

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Make your changes
  4. Create a changeset: bun run changeset
  5. Commit your changes (pre-commit hooks will run automatically)
  6. Push and create a Pull Request

License

Unlicense - Public Domain

This is free and unencumbered software released into the public domain. You can copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial.

About

The best in class library and CLI tool to convert pdf to markdown

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •