Ultra-fast web scraper with Playwright that converts HTML pages to clean Markdown format. Designed for documentation archiving and RAG (Retrieval-Augmented Generation) workflows.
Works seamlessly in:
- Interactive terminals (bash, zsh, fish, etc.) with TUI menu
- Claude Code / AI coding assistants (auto-detects non-interactive mode)
- CI/CD pipelines and automated workflows
- High Performance: Averages 0.8-1.2 seconds per URL using Playwright
- Clean Output: Removes navigation menus, duplicated tables of contents, and unnecessary elements
- Smart Filenames: Generates short, readable filenames from URL paths
- Hybrid Mode: Automatically detects interactive terminal vs non-interactive environments
- Resource Blocking: Blocks 10+ resource patterns (ads, analytics, tracking) for faster scraping
- Markdown Optimization: Single table of contents preserved for RAG applications
- Batch Processing: Process multiple URLs sequentially with progress tracking
- Node.js 18+ - Download from nodejs.org or install via package manager:
# Ubuntu/Debian curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - sudo apt-get install -y nodejs # macOS (with Homebrew) brew install node # Or use nvm (recommended) curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash nvm install 18
- npm or yarn (included with Node.js)
# Clone the repository
git clone https://github.com/isSpicyCode/scrappe-tout.git
cd scrappe-tout
# Install dependencies
npm install
# Install Playwright browsers
npx playwright install chromium- Add your URLs to scrape in the
urls.txtfile (one URL per line):
https://example.com/docs/getting-started
https://example.com/docs/installation
https://example.com/docs/architecture
- Run the scraper:
npm startProgress Display:
[1/30] [100%] [████████████████████████████] https://example.com/docs/getting-started (1s)
[2/30] [100%] [████████████████████████████] https://example.com/docs/installation (1s)
[3/30] [100%] [████████████████████████████] https://example.com/docs/architecture (1s)
Each URL shows its individual progress bar reaching 100%, then moves to the next line.
When running in an interactive terminal, the behavior depends on whether a scrap-folder-name.txt file exists:
If scrap-folder-name.txt exists:
- The folder name from the file is used automatically
- The menu is bypassed
If no scrap-folder-name.txt file:
- A TUI menu appears for folder selection:
============================================================
FOLDER MENU - Select an option
============================================================
Existing folders:
1. my-files-docs
2. scrap [default]
Options:
N - Create new folder (default: "scrap")
D - Use default folder
X - Delete existing folder
Q - Quit
============================================================
Your choice:
Options:
- N: Create a new folder with custom name
- D: Use default "scrap" folder
- 1, 2, ...: Use existing folder
- X: Delete an existing folder
- Q: Quit
In non-interactive environments (Claude Code, CI/CD), the folder name is determined automatically:
Priority order:
-
scrap-folder-name.txtfile (if exists)# Create folder name file echo "my-documentation" > scrap-folder-name.txt # Run scraper npm start
-
Auto-generated timestamp (if no file exists)
- Automatically creates a timestamped folder (e.g.,
scrap-2026-02-13T10-49-30) - Ensures no conflicts between runs
- Automatically creates a timestamped folder (e.g.,
The scraper determines the folder name in this order:
- 🥇
--nameflag (if provided) - Highest priority - 🥈
scrap-folder-name.txtfile (if exists) - Works in both interactive and non-interactive modes - 🥉 Interactive menu (if terminal is interactive AND no file exists)
- ⏰ Timestamp (fallback if nothing else is available)
# Specify custom output directory
npm start -- --output-dir /path/to/output
# Show help
npm start -- --helpcaptures/
├── my-files-docs/
│ ├── inspector.md
│ ├── memory.md
│ ├── performance.md
│ └── ...
└── scrap-2026-02-13T10-49-30/
├── ui.md
├── components.md
└── ...
Each URL generates a single Markdown file with:
- Clean content (no navigation, ads, or clutter)
- Preserved table of contents for RAG applications
- Short filename based on the last URL path segment
The scraper automatically blocks these resource patterns for faster loading:
- Analytics and tracking scripts
- Advertisement networks
- Social media widgets
- Fonts and stylesheets from CDNs
- Images (can be enabled in config)
Edit src/core/scraper.js to modify the RESOURCE_PATTERNS array.
| Metric | Value |
|---|---|
| Average per URL | 0.8-1.2 seconds |
| HTML compression | 80-95% size reduction |
| Conversion speed | 10-20x faster than single-file scripts |
| Parallel processing | Sequential (prevents rate limiting) |
scrappe-tout/
├── src/
│ ├── core/
│ │ ├── scraper.js # Playwright scraping logic
│ │ ├── converter.js # HTML to Markdown conversion
│ │ ├── writer.js # File writing with smart naming
│ │ ├── postprocessor.js # Content cleaning
│ │ ├── logger.js # Logging utilities
│ │ └── navigation-cleaner.js # Navigation patterns and removal
│ ├── services/
│ │ ├── config.js # Configuration management
│ │ ├── retry.js # Retry logic with exponential backoff
│ │ ├── error.js # Error handling
│ │ ├── urls.js # URL reading and validation
│ │ ├── pipeline.js # Scraping pipeline orchestration
│ │ └── path.js # Output directory management
│ ├── utils/
│ │ ├── cli.js # Command-line argument parsing
│ │ ├── menu.js # Interactive TUI menu
│ │ ├── display.js # Duration formatting and progress bar
│ │ ├── stats.js # Statistics generation
│ │ ├── timestamp.js # Timestamp utilities
│ │ └── constants.js # Application constants
│ └── index.js # Main entry point (orchestration only)
├── tests/
│ ├── e2e/
│ │ └── scraping.test.js # End-to-end workflow tests
│ ├── unit/
│ │ ├── cli.test.js # CLI parser tests
│ │ ├── display.test.js # Display utilities tests
│ │ └── stats.test.js # Statistics tests
│ └── fixtures/
│ └── test-urls.txt # Sample URLs for testing
├── urls.txt # URLs to scrape (one per line)
├── scrap-folder-name.txt # Custom output folder name
└── package.json
Built with:
- Playwright - Fast and reliable web automation
- mdream - HTML to Markdown conversion
- Biome - Linting and formatting
npx playwright install chromium# Install required dependencies
sudo npx playwright install-deps chromium- Check that URLs in
urls.txtare accessible - Some sites may require authentication or block automated access
- Try adding delays or reducing concurrency for rate-limited sites
# Reinstall dependencies
rm -rf node_modules package-lock.json
npm installMIT License - see LICENSE file for details.
Copyright (c) 2026 Spicycode - Scrappe-Tout Contributors