docutext

Zero-dependency PDF text extraction built for RAG and AI pipelines. Parses PDFs from scratch -- no PDF.js, no WASM, no native addons. Works in Node.js and the browser.

Performance

Library	Small PDF (31 KB)	Large PDF (1.3 MB)	Bundle (gzip)	Dependencies
docutext	3 ms	40 ms	~24 KB	0
pdfjs-dist	5 ms	244 ms	~1.3 MB	0 (but large)
pdf-parse	5 ms	279 ms	~780 KB	1 (pdfjs)
unpdf	6 ms	232 ms	~320 KB	2 (pdfjs)

Median of 3 runs, Node.js, Apple Silicon.

docutext is purpose-built for text extraction in RAG/AI workflows -- not a general PDF toolkit. It does one thing and does it fast.

Install

# Node.js (zero dependencies)
pnpm add docutext

# Browser / bundler (add fflate for decompression, ~3 KB gzip)
pnpm add docutext fflate

Requires Node.js 18+. Browser builds target ES2020+.

Quick Start

import { DocuText } from 'docutext';

const doc = await DocuText.load('document.pdf');
console.log(doc.text);

Structured Markdown

import { DocuText } from 'docutext';
import { docToMarkdown, pageToMarkdown } from 'docutext/markdown';

const doc = await DocuText.load('document.pdf');
console.log(docToMarkdown(doc));       // headings, bold, links
console.log(pageToMarkdown(doc.pages[0]));

Browser

import { DocuText } from 'docutext';

const response = await fetch('/document.pdf');
const bytes = new Uint8Array(await response.arrayBuffer());
const doc = DocuText.fromBuffer(bytes);
console.log(doc.text);

Page-by-page

for (const page of doc) {
  console.log(`Page ${page.number}: ${page.text}`);
}

Key Features

Zero dependencies in Node.js. Single optional peer dep (fflate) for browser.
~24 KB gzipped -- 50x smaller than pdfjs-dist.
6x faster than alternatives on real-world documents.
Plain text + structured markdown output (headings inferred from font size, bold/italic, links).
Column-aware text flow -- side-by-side columns (e.g. signature blocks) are read column-first to keep related data together.
Lazy extraction -- accessing page.text only processes that page.
Full PDF parsing -- xref tables, stream filters, font encodings, ToUnicode CMaps, form XObjects.
Encrypted PDF support -- handles permission-encrypted PDFs (empty password, RC4/AES-128).
Runs everywhere -- Node.js and browser via conditional exports.
ESM only -- modern module system, tree-shakeable.
TypeScript first -- full type definitions included.

Documentation

For the full API reference, architecture details, and a live playground, see the documentation site.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
examples		examples
site		site
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vite.config.site.ts		vite.config.site.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docutext

Performance

Install

Quick Start

Structured Markdown

Browser

Page-by-page

Key Features

Documentation

License

About

Uh oh!

Releases 2

Packages

Languages

License

barryking/docutext

Folders and files

Latest commit

History

Repository files navigation

docutext

Performance

Install

Quick Start

Structured Markdown

Browser

Page-by-page

Key Features

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages