Pandoc Batch Processor

📚 Pandoc Batch Processor with Markdown Auto-Patching

Overview

The pandoc_batch.py script is a utility built into the write-book-template workflow for automating Pandoc conversions across all Markdown (.md) files in a book project.

It handles:

Batch processing of all .md files (recursively from a root folder)
Parallel conversions for faster builds
Output path mirroring (maintains the same directory structure as the source)
Optional auto-patching of Markdown before conversion to fix common Pandoc pitfalls
Safe test mode (--test-only) for checking files without generating outputs

Why Auto-Patching?

Pandoc is strict about Markdown syntax. Certain patterns can cause build errors or unexpected rendering.
One frequent issue in book manuscripts is horizontal rules (---, ***, ___) immediately followed by text without a blank line:

---
*This text may be misinterpreted by Pandoc*

This can break EPUB, PDF, and HTML builds because Pandoc merges the text into the horizontal rule block.

Solution: Always insert a blank line after thematic breaks:

---

*This is now correctly parsed as a new paragraph.*

The auto-patching system detects and fixes these issues automatically during the build.

Features

✅ 1. Recursive Batch Conversion

Scans a root folder (default: manuscript/) for all .md files.
Processes them in parallel (--jobs option) for speed.
Supports output formats: epub, html, pdf, docx, odt, rtf.

✅ 2. Configurable Defaults

Reads default settings from [tool.pandoc_batch] in pyproject.toml.
Allows running poetry run pandoc-batch with no extra flags.

Example pyproject.toml block:

[tool.pandoc_batch]
root = "manuscript"
outdir = "output"
to = "epub"
metadata_file = "config/metadata.yaml"
resource_path = ["assets"]
lang = "en"
jobs = 4
verbose = true
standalone = true
test_only = false
patch_md = true
fix_inplace = false

✅ 3. Auto-Patching System

By default (patch_md = true), the script will:

Remove UTF-8 BOMs (Byte Order Marks) from the start of files.
Normalize line endings to \n (Unix-style).
Insert blank lines after thematic breaks if missing.

Pattern detection:

Horizontal rules: ---, ***, ___

Regex used:

^(?:-{3}|(?:\*\s*){3}|(?:_\s*){3})\s*\n(?!\s*\n)

Before/After Examples

Before (will cause issues)	After (auto-patched)
`markdown<br>---<br>Dieses Buch ist keine Schlussfolgerung.<br>`	`markdown<br>---<br><br>Dieses Buch ist keine Schlussfolgerung.<br>`
`markdown<br>*<br>Chapter End**<br>`	`markdown<br>*<br><br>Chapter End**<br>`
`markdown<br>___<br>Text starts immediately<br>`	`markdown<br>___<br><br>Text starts immediately<br>`

Workflow Diagram

 ┌──────────────────┐
 │  Markdown Files   │
 │  (manuscript/)    │
 └────────┬─────────┘
          │
          ▼
 ┌──────────────────┐
 │  Auto-Patch Step  │
 │  - Strip BOM      │
 │  - Normalize \n   │
 │  - Fix HR blocks  │
 └────────┬─────────┘
          │
          ▼
 ┌──────────────────┐
 │   Pandoc Convert  │
 │   - Format: EPUB  │
 │   - Resources     │
 │   - Metadata      │
 └────────┬─────────┘
          │
          ▼
 ┌──────────────────┐
 │   Output Files    │
 │  (output/...)     │
 └──────────────────┘

Usage

Basic Conversion

poetry run pandoc-batch

Uses settings from pyproject.toml.

Test Mode (No Output)

poetry run pandoc-batch --test-only

Checks all .md files for Pandoc parsing errors without writing output.

Change Output Format

poetry run pandoc-batch --to pdf --extra --pdf-engine xelatex

Run Without Auto-Patching

poetry run pandoc-batch --no-patch-md

Permanently Fix Markdown Files

poetry run pandoc-batch --fix-inplace

When used with --fix-inplace, the script writes patched Markdown back to the original files instead of only using a temporary patched copy for Pandoc.

What gets fixed:

Adds a blank line after horizontal rules (---, ***, ___) if the next line is not blank.
Removes any UTF-8 BOM at the start of the file.
Normalizes all line endings to \n (Unix style).

Important notes:

Only files where a change is actually detected will be modified.
If your file has --- but it’s not on a line by itself (e.g., --- text), it will not be changed — that’s intentional.
If you want to verify which files will be changed before running --fix-inplace, you can run:
```
poetry run pandoc-batch --test-only
```
This will use the in-memory patched version to test the build without altering your sources.

Troubleshooting:
If --fix-inplace makes no changes when you expect it to, check that your horizontal rule is exactly on its own line with no extra text.

How It Works Internally

File Collection
Finds all .md files under --root using Path.rglob().
Output Path Mapping
Builds a mirrored path under --outdir with the correct file extension.
Auto-Patching (Optional)
- Reads the file in memory
- Runs the patching regex and BOM remover
- Writes the result to a temporary file (or in-place if --fix-inplace is set)
- Passes the patched file to Pandoc
Pandoc Invocation
- Uses flags from CLI or pyproject.toml
- Supports extra arguments via --extra
Parallel Execution
- Uses ThreadPoolExecutor to run multiple Pandoc processes simultaneously
- Job count is configurable with --jobs

Example Workflow

Prepare Manuscript
Place all .md files in manuscript/, with subfolders for front-matter/, chapters/, back-matter/.
Check for Pandoc Errors Without Output
```
poetry run pandoc-batch --test-only
```
Convert to EPUB (using defaults in pyproject.toml)
```
poetry run pandoc-batch
```

Convert to PDF (override defaults)

poetry run pandoc-batch --to pdf --extra --pdf-engine xelatex

Troubleshooting

Pandoc “withBinaryFile: does not exist” error
→ Usually caused by missing output directories. The script now creates them automatically.
Strange formatting after ---
→ Caused by missing blank lines; the auto-patch fixes this automatically.
Encoding errors
→ Auto-patching removes BOMs and ensures UTF-8 compliance.
Want to disable patching for performance?
→ Run with --no-patch-md or set patch_md = false in pyproject.toml.

Summary

The enhanced pandoc_batch.py is designed for robust, automated, and error-resistant Pandoc builds in book projects.
Its auto-patching ensures consistent formatting and eliminates one of the most common causes of EPUB/PDF generation failures.

Tip: Keep patch_md enabled for all production builds — it’s a safety net that costs almost no performance.

📚 write-book-template Wiki

🏁 Getting Started

✏️ Writing Tools

🌐 Translation Tools

📤 Export & Publishing

🎧 Audio Tools

Generate Audiobook

⚡ Project Shortcuts

🧪 Quality & Testing

Test Coverage

Use this sidebar to navigate all key workflows — from setup to translation, export, and testing.

Pandoc Batch Processor

📚 Pandoc Batch Processor with Markdown Auto-Patching

Overview

Why Auto-Patching?

Features

✅ 1. Recursive Batch Conversion

✅ 2. Configurable Defaults

✅ 3. Auto-Patching System

Before/After Examples

Workflow Diagram

Usage

Basic Conversion

Test Mode (No Output)

Change Output Format

Run Without Auto-Patching

Permanently Fix Markdown Files

How It Works Internally

Example Workflow

Troubleshooting

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

📚 write-book-template Wiki

🏁 Getting Started

✏️ Writing Tools

🌐 Translation Tools

📤 Export & Publishing

🎧 Audio Tools

⚡ Project Shortcuts

🧪 Quality & Testing

Clone this wiki locally