A PDF text extraction library written in Zig.
- Memory-mapped file reading, zero-copy where possible
- Streaming text extraction with efficient arena allocation
- Multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength
- Font encoding support: WinAnsi, MacRoman, ToUnicode CMap
- XRef table and stream parsing (PDF 1.5+)
- Configurable error handling (strict or permissive)
- Structure tree extraction for tagged PDFs (PDF/UA)
- Fast stream order fallback for non-tagged PDFs
Text extraction performance on Apple M4 Pro (reading order):
| Document | Pages | zpdf | MuPDF | Speedup |
|---|---|---|---|---|
| Intel SDM | 5,252 | 582ms | 2,152ms | 3.7x |
| Pandas Docs | 3,743 | 640ms | 1,130ms | 1.8x |
| C++ Standard | 2,134 | 438ms | 1,007ms | 2.3x |
| PDF Reference 1.7 | 1,310 | 236ms | 1,481ms | 6.3x |
Build with zig build -Doptimize=ReleaseFast for best performance.
- Zig 0.15.2 or later
zig build # Build library and CLI
zig build test # Run testsconst zpdf = @import("zpdf");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const doc = try zpdf.Document.open(allocator, "file.pdf");
defer doc.close();
var buf: [4096]u8 = undefined;
var writer = std.fs.File.stdout().writer(&buf);
defer writer.interface.flush() catch {};
for (0..doc.pages.items.len) |page_num| {
try doc.extractText(page_num, &writer.interface);
}
}zpdf extract document.pdf # Extract all pages (uses structure tree for reading order)
zpdf extract -p 1-10 document.pdf # Extract pages 1-10
zpdf extract -o out.txt document.pdf # Output to file
zpdf info document.pdf # Show document info
zpdf bench document.pdf # Run benchmarkimport zpdf
with zpdf.Document("file.pdf") as doc:
print(doc.page_count)
# Single page
text = doc.extract_page(0)
# All pages (reading order by default)
all_text = doc.extract_all()
# Page info
info = doc.get_page_info(0)
print(f"{info.width}x{info.height}")Build the shared library first:
zig build -Doptimize=ReleaseFast
PYTHONPATH=python python3 examples/basic.pysrc/
├── root.zig # Document API and core types
├── main.zig # CLI entry point
├── capi.zig # C ABI exports for FFI
├── wapi.zig # WASM API exports
├── parser.zig # PDF object parser
├── xref.zig # XRef table/stream parsing
├── pagetree.zig # Page tree resolution
├── decompress.zig # Stream decompression filters
├── encoding.zig # Font encoding and CMap parsing
├── agl.zig # Adobe Glyph List mappings
├── cff.zig # CFF/Type1 font parsing
├── interpreter.zig # Content stream interpreter
├── structtree.zig # Structure tree parser (PDF/UA)
├── layout.zig # Text layout and bounding boxes
└── simd.zig # SIMD-accelerated parsing
python/zpdf/ # Python bindings (cffi)
examples/ # Usage examples
zpdf extracts text in logical reading order using a two-tier approach:
-
Structure Tree (preferred): Uses the PDF's semantic structure for tagged/accessible PDFs (PDF/UA). Correctly handles multi-column layouts, sidebars, tables, and captions.
-
Stream Order (fallback): When no structure tree exists, extracts text in PDF content stream order. This is fast and works well for most single-column documents.
| Method | Pros | Cons |
|---|---|---|
| Structure tree | Correct semantic order, handles complex layouts | Only works on tagged PDFs |
| Stream order | Fast, works on any PDF | May not match visual order for complex layouts |
| Feature | zpdf | pdfium | MuPDF |
|---|---|---|---|
| Text Extraction | |||
| Stream order | Yes | Yes | Yes |
| Tagged/structure tree | Yes | No | Yes |
| Visual reading order | No | No | Yes |
| Word bounding boxes | Yes | Yes | Yes |
| Font Support | |||
| WinAnsi/MacRoman | Yes | Yes | Yes |
| ToUnicode CMap | Yes | Yes | Yes |
| CID fonts (Type0) | Partial* | Yes | Yes |
| Compression | |||
| FlateDecode, LZW, ASCII85/Hex | Yes | Yes | Yes |
| JBIG2, JPEG2000 | No | Yes | Yes |
| Other | |||
| Encrypted PDFs | No | Yes | Yes |
| Rendering | No | Yes | Yes |
*CID fonts: Works when CMap is embedded directly.
Use zpdf when: Batch processing, tagged PDFs (PDF/UA), simple text extraction, Zig integration.
Use pdfium when: Browser integration, full PDF support, proven stability.
Use MuPDF when: Complex visual layouts, rendering needed.
CC0 - Public Domain