Skip to content

Pandoc Batch Processor

Asterios Raptis edited this page Aug 9, 2025 · 4 revisions

πŸ“š Pandoc Batch Processor with Markdown Auto-Patching

Overview

The pandoc_batch.py script is a utility built into the write-book-template workflow for automating Pandoc conversions across all Markdown (.md) files in a book project.

It handles:

  • Batch processing of all .md files (recursively from a root folder)
  • Parallel conversions for faster builds
  • Output path mirroring (maintains the same directory structure as the source)
  • Optional auto-patching of Markdown before conversion to fix common Pandoc pitfalls
  • Safe test mode (--test-only) for checking files without generating outputs

Why Auto-Patching?

Pandoc is strict about Markdown syntax. Certain patterns can cause build errors or unexpected rendering.
One frequent issue in book manuscripts is horizontal rules (---, ***, ___) immediately followed by text without a blank line:

---
*This text may be misinterpreted by Pandoc*

This can break EPUB, PDF, and HTML builds because Pandoc merges the text into the horizontal rule block.

Solution: Always insert a blank line after thematic breaks:

---

*This is now correctly parsed as a new paragraph.*

The auto-patching system detects and fixes these issues automatically during the build.


Features

βœ… 1. Recursive Batch Conversion

  • Scans a root folder (default: manuscript/) for all .md files.

  • Processes them in parallel (--jobs option) for speed.

  • Supports output formats: epub, html, pdf, docx, odt, rtf.

βœ… 2. Configurable Defaults

  • Reads default settings from [tool.pandoc_batch] in pyproject.toml.

  • Allows running poetry run pandoc-batch with no extra flags.

Example pyproject.toml block:

[tool.pandoc_batch]
root = "manuscript"
outdir = "output"
to = "epub"
metadata_file = "config/metadata.yaml"
resource_path = ["assets"]
lang = "en"
jobs = 4
verbose = true
standalone = true
test_only = false
patch_md = true
fix_inplace = false

βœ… 3. Auto-Patching System

By default (patch_md = true), the script will:

  1. Remove UTF-8 BOMs (Byte Order Marks) from the start of files.

  2. Normalize line endings to \n (Unix-style).

  3. Insert blank lines after thematic breaks if missing.

Pattern detection:

  • Horizontal rules: ---, ***, ___

  • Regex used:

    ^(?:-{3}|(?:\*\s*){3}|(?:_\s*){3})\s*\n(?!\s*\n)

Before/After Examples

Before (will cause issues) After (auto-patched)
markdown<br>---<br>*Dieses Buch ist keine Schlussfolgerung.*<br> markdown<br>---<br><br>*Dieses Buch ist keine Schlussfolgerung.*<br>
markdown<br>***<br>**Chapter End**<br> markdown<br>***<br><br>**Chapter End**<br>
markdown<br>___<br>Text starts immediately<br> markdown<br>___<br><br>Text starts immediately<br>

Workflow Diagram

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚  Markdown Files   β”‚
 β”‚  (manuscript/)    β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚  Auto-Patch Step  β”‚
 β”‚  - Strip BOM      β”‚
 β”‚  - Normalize \n   β”‚
 β”‚  - Fix HR blocks  β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚   Pandoc Convert  β”‚
 β”‚   - Format: EPUB  β”‚
 β”‚   - Resources     β”‚
 β”‚   - Metadata      β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚   Output Files    β”‚
 β”‚  (output/...)     β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Usage

Basic Conversion

poetry run pandoc-batch

Uses settings from pyproject.toml.

Test Mode (No Output)

poetry run pandoc-batch --test-only

Checks all .md files for Pandoc parsing errors without writing output.

Change Output Format

poetry run pandoc-batch --to pdf --extra --pdf-engine xelatex

Run Without Auto-Patching

poetry run pandoc-batch --no-patch-md

Permanently Fix Markdown Files

poetry run pandoc-batch --fix-inplace

When used with --fix-inplace, the script writes patched Markdown back to the original files instead of only using a temporary patched copy for Pandoc.

What gets fixed:

  • Adds a blank line after horizontal rules (---, ***, ___) if the next line is not blank.

  • Removes any UTF-8 BOM at the start of the file.

  • Normalizes all line endings to \n (Unix style).

Important notes:

  • Only files where a change is actually detected will be modified.

  • If your file has --- but it’s not on a line by itself (e.g., --- text), it will not be changed β€” that’s intentional.

  • If you want to verify which files will be changed before running --fix-inplace, you can run:

    poetry run pandoc-batch --test-only

    This will use the in-memory patched version to test the build without altering your sources.

Troubleshooting:
If --fix-inplace makes no changes when you expect it to, check that your horizontal rule is exactly on its own line with no extra text.


How It Works Internally

  1. File Collection
    Finds all .md files under --root using Path.rglob().

  2. Output Path Mapping
    Builds a mirrored path under --outdir with the correct file extension.

  3. Auto-Patching (Optional)

    • Reads the file in memory

    • Runs the patching regex and BOM remover

    • Writes the result to a temporary file (or in-place if --fix-inplace is set)

    • Passes the patched file to Pandoc

  4. Pandoc Invocation

    • Uses flags from CLI or pyproject.toml

    • Supports extra arguments via --extra

  5. Parallel Execution

    • Uses ThreadPoolExecutor to run multiple Pandoc processes simultaneously

    • Job count is configurable with --jobs


Example Workflow

  1. Prepare Manuscript
    Place all .md files in manuscript/, with subfolders for front-matter/, chapters/, back-matter/.

  2. Check for Pandoc Errors Without Output

    poetry run pandoc-batch --test-only
  3. Convert to EPUB (using defaults in pyproject.toml)

    poetry run pandoc-batch
  4. Convert to PDF (override defaults)

    poetry run pandoc-batch --to pdf --extra --pdf-engine xelatex

Troubleshooting

  • Pandoc β€œwithBinaryFile: does not exist” error
    β†’ Usually caused by missing output directories. The script now creates them automatically.

  • Strange formatting after ---
    β†’ Caused by missing blank lines; the auto-patch fixes this automatically.

  • Encoding errors
    β†’ Auto-patching removes BOMs and ensures UTF-8 compliance.

  • Want to disable patching for performance?
    β†’ Run with --no-patch-md or set patch_md = false in pyproject.toml.


Summary

The enhanced pandoc_batch.py is designed for robust, automated, and error-resistant Pandoc builds in book projects.
Its auto-patching ensures consistent formatting and eliminates one of the most common causes of EPUB/PDF generation failures.

Tip: Keep patch_md enabled for all production builds β€” it’s a safety net that costs almost no performance.

Clone this wiki locally