Skip to content

Convert any webpage to clean markdown, with Chrome cookie auth support

License

Notifications You must be signed in to change notification settings

fiale-plus/mdbrowser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mdbrowser

Convert any webpage to clean markdown, with Chrome cookie auth support.

Install

git clone git@github.com:fiale-plus/mdbrowser.git
cd mdbrowser && npm install

Usage

mdbrowser <url> [options]

Options

--render             Use headless Puppeteer for JS-rendered pages
--chrome             Use Chrome cookies for authentication
--profile <name>     Chrome profile name (default: "Default")
-o, --output <file>  Write to file instead of stdout
--raw                Full body markdown (skip article extraction)
--no-readability     Same as --raw (backward compat)
--meta               Prepend YAML frontmatter with page metadata
--json               Output structured JSON
--selector <css>     Extract specific CSS selector (implies --render)
--batch              Process multiple URLs from args or stdin
--no-auto-render     Disable automatic render fallback on thin content
--no-cache           Bypass URL cache
--cache-ttl <sec>    Cache TTL in seconds (default: 900)
--timeout <ms>       Fetch/render timeout (default: 30000)
-h, --help           Show help
-v, --version        Show version

Examples

# Basic — fetch and convert to markdown
mdbrowser https://example.com

# Save to file
mdbrowser https://example.com -o page.md

# Authenticated page via Chrome cookies (macOS)
mdbrowser https://github.com/settings/profile --chrome

# JS-rendered SPA
mdbrowser https://some-spa.com --render

# YAML frontmatter with metadata
mdbrowser https://example.com --meta

# Structured JSON output
mdbrowser https://example.com --json

# Extract specific element
mdbrowser https://some-spa.com --selector "main"

# Batch mode — multiple URLs
echo "https://a.com
https://b.com" | mdbrowser --batch

# Raw body (skip article extraction)
mdbrowser https://example.com --raw

# Pipe to other tools
mdbrowser https://example.com | head -20

Interactive Mode

mdbrowser open <url> [options]      # Start browser session
mdbrowser click <ref|"text">        # Click element
mdbrowser type <ref|"text"> <value> # Type into input
mdbrowser read                      # Re-read page state
mdbrowser close                     # End session

How It Works

Mode Flag How it works
Basic (default) fetch() → Defuddle → Markdown
Chrome cookies --chrome Extracts cookies from Chrome's local DB, injects into fetch
Headless render --render Puppeteer renders JS, then same pipeline
Auto-render (automatic) If extracted content < 50 words, retries with --render

Features

  • Content extraction via Defuddle — better article detection, metadata, and markdown output
  • Metadata extraction — title, author, published date, domain, word count
  • YAML frontmatter (--meta) — prepends metadata as YAML
  • JSON output (--json) — structured output with all fields
  • Token estimation — always printed to stderr
  • URL caching — 15-minute TTL file cache for repeated fetches (8x speedup)
  • Batch mode (--batch) — process multiple URLs from stdin
  • CSS selector (--selector) — extract specific page elements via puppeteer
  • Auto-render fallback — thin content auto-retries with headless rendering
  • Interactive sessions — open, click, type, read pages with puppeteer

Optional Dependencies

Core install is lightweight (defuddle, jsdom). Two features need extra packages:

# For --chrome (Chrome cookie extraction)
npm install better-sqlite3

# For --render (headless browser rendering)
npm install puppeteer

If you use --chrome or --render without the corresponding package, you'll get a clear error with the install command.

Chrome Cookie Auth (macOS)

The --chrome flag reads cookies from Chrome's local SQLite database and injects them as a Cookie header. This lets you fetch authenticated pages (GitHub, Google Docs, internal tools) without configuring API tokens.

  • Requires macOS (Keychain access for the encryption key)
  • Copies the DB to a temp file to avoid locking conflicts with Chrome
  • Handles Chrome 130+ format (DB version ≥ 24) with domain hash prefix
  • First run triggers a Keychain access prompt — allow it once

License

MIT

About

Convert any webpage to clean markdown, with Chrome cookie auth support

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages