A high-performance, asynchronous CLI tool that recursively crawls directories to find and convert document files using embedded Pandoc. Built with Rust for speed, safety, and reliability.
- 🚀 Blazing Fast: Asynchronous parallel conversion using Tokio
- 📦 Self-Contained: Embeds UPX-compressed Pandoc binary (Windows) - no external dependencies required
- 🔍 Smart Crawling: Recursively searches directories for files by extension
- 🛡️ Robust: Automatically fixes problematic filenames containing
$or~characters - 📊 Detailed Logging: Configurable verbosity levels (Error, Warn, Info, Debug, Trace)
- 🎯 Flexible Output: Convert in-place or to a custom output directory
- 📁 Media Extraction: Automatically extracts and organizes embedded media from documents
- ⚡ Efficient: Uses cranelift backend for faster compilation during development
- Rust 1.83+ (uses edition 2024)
- Cargo
git clone https://github.com/MrDwarf7/document_conversion_crawler_rs.git
cd document_conversion_crawler_rs
cargo build --releaseThe compiled binary will be in target/release/document_conversion_crawler_rs
document_conversion_crawler_rs <INPUT_DIR> <INPUT_EXT> <OUTPUT_EXT> [OPTIONS]<INPUT_DIR>- Root directory to crawl for files<INPUT_EXT>- Input file extension to search for (e.g.,docx,.docx)<OUTPUT_EXT>- Output format extension (e.g.,md,html,pdf)
-o, --output <DIR>- Custom output directory for converted files-l, --level_verbosity <LEVEL>- Logging verbosity (ERROR/0, WARN/1, INFO/2, DEBUG/3, TRACE/4)- Default: INFO
document_conversion_crawler_rs ./documents docx mddocument_conversion_crawler_rs ./documents docx md -o ./converteddocument_conversion_crawler_rs ./documents docx html -l DEBUGdocument_conversion_crawler_rs ./documents docx pdf -l 3- Initialization: The tool initializes the async runtime and logger
- Pandoc Setup: Extracts the embedded Pandoc binary to the system temp directory (Windows)
- Directory Crawling: Recursively walks the input directory tree
- Filename Sanitization: Fixes problematic filenames containing
$or~characters - File Discovery: Collects all files matching the input extension
- Parallel Conversion: Spawns async tasks to convert files concurrently
- Media Extraction: Creates
<filename>/media/folders for extracted document media - Output Organization: Places converted files in the output directory (if specified)
- Progress Reporting: Logs conversion progress and provides success statistics
src/
├── main.rs # Application entry point and orchestration
├── prelude.rs # Common imports, utilities, and pandoc embedding
├── error.rs # Custom error types using thiserror
├── cli.rs # Command-line argument parsing with clap
├── lazy_logger.rs # Buffered logger implementation
└── conversion/
├── mod.rs # Core conversion logic and file discovery
└── pandoc.rs # Pandoc converter implementation
The Converter trait provides an abstraction for different conversion backends:
#[async_trait::async_trait]
pub trait Converter {
async fn convert(&self, input: PathBuf, output: PathBuf) -> Result<()>;
async fn check_installed(&self) -> Result<bool>;
fn name(&self) -> impl AsRef<str>;
}On Windows, the tool embeds a UPX-compressed Pandoc binary (~30MB → ~10MB) directly into the executable. On first run, it extracts the binary to:
<TEMP_DIR>/pandoc_upx.exe
This eliminates the need for users to install Pandoc separately.
Each file conversion runs in a separate Tokio task, enabling parallel processing:
let tasks: Vec<_> = files
.into_iter()
.map(|file| tokio::task::spawn(async move {
converter.convert(file, output).await
}))
.collect();The tool supports any format that Pandoc supports, including:
Input Formats: docx, odt, epub, html, latex, markdown, rst, textile, org, and more
Output Formats: markdown, html, pdf, docx, epub, latex, rst, org, and more
See Pandoc's documentation for the complete list.
The tool provides detailed error messages for common issues:
- File Access: Permission denied, file not found
- Conversion Failures: Invalid input format, corrupted files
- Directory Issues: Cannot create output directories
- Pandoc Errors: Stderr output from Pandoc is captured and logged
Example error output:
ERROR: Failed to convert files due to: Pandoc conversion error, failed for: document.docx
- Concurrent Execution: Processes multiple files simultaneously
- Optimized Binary: Release builds use
opt-level = 3and single codegen unit - Development Speed: Uses cranelift backend for faster compilation
- Efficient Dependencies: Minimal dependency tree focused on performance
- Edition 2024: Uses the latest Rust edition
- Cranelift Backend: Fast compilation in development mode
- Optimized Dependencies: All dependencies compiled with
opt-level = 3
The tool respects standard Rust environment variables:
RUST_LOG: Override logging levels (e.g.,RUST_LOG=debug)RUST_BACKTRACE: Enable backtraces on panic
The tool uses tracing and tracing-subscriber for structured logging:
- Line numbers and thread IDs included
- ANSI color support
- Configurable log levels per module
- Timestamp information available
Example output:
INFO: Found 42 files to convert
INFO: Running conversion for 42 files
DEBUG: Converting 'report.docx' to 'report.md'
INFO: Successfully converted all files
INFO: Processed a total of: 42 files
INFO: Successfully processed: 42 files
INFO: Success rate: 100.00%
- Embedded Pandoc binary is Windows-only (Linux/Mac users need Pandoc installed separately)
- Zipped output from conversion tasks may mismatch if top-level folders < individual files
- File overwrites are skipped (warns if output exists)
Contributions are welcome! Please feel free to submit a Pull Request.
# Clone the repository
git clone https://github.com/MrDwarf7/document_conversion_crawler_rs.git
cd document_conversion_crawler_rs
# Run with watch mode (requires cargo-watch)
cargo watch -q -c -w src/ -x run
# Run tests
cargo test
# Run with development optimizations
cargo buildDue to the direct inclusion of the pandoc binary, this project requires the code be licensed under "GNU General Public License v2.0". All conditions of the currently provided license apply based on the requirements specified under the pandoc project.
- tokio: Async runtime
- clap: CLI argument parsing
- tracing: Structured logging
- walkdir: Directory traversal
- eyre: Error handling
- thiserror: Custom error types
- async-trait: Async trait support
For issues, feature requests, or questions, please open an issue on GitHub.