3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.
-
Updated
Feb 10, 2026 - Rust
3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.
The Complete PDF Toolkit: extract, create, and edit PDFs. Rust core with Python bindings.
Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code. Process documents 100x larger than context windows through intelligent chunking, SQLite persistence, and recursive sub-LLM orchestration.
Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.
Convert scans of handwritten notes to PDF.
High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
📄 Ingest documents into structured datasets for LLMs, ensuring numeric integrity and easy export across multiple frameworks with doc2dataset.
PDF and image to-text converter with XFA forms support. It extract embedded text, and/or render pages into upscaled images for OCR to handle complex layouts and scans. Single static binary, reads stdin/writes stdout. Built for n8n, Power Automate, and containerized workflows.
High-performance Rust API with AI, multi-format docs, Gemini integration, security, CLI.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."