Skip to content

Asynchronous headless-Chrome web crawler that discovers internal links and optionally saves HTML, Markdown, screenshots, or PDFs. Built for scripting, inspection, and automation.

License

Notifications You must be signed in to change notification settings

kgruiz/stealth-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stealth Crawler Icon

Stealth Crawler

A headless-Chrome web crawler that discovers same-host links and optionally saves HTML, Markdown, PDF, or screenshots. Use as a library or via the stealth-crawler CLI.

PyPI    License


Features

  • Asynchronous, headless Chrome browsing via pydoll
  • Discovers internal links starting from a root URL
  • Optional content saving:
    • HTML
    • Markdown (via html2text)
    • PDF snapshots
    • PNG screenshots
  • Rich progress bars with rich
  • Configurable URL filtering (base, exclude)
  • Pure-Python API and CLI

Installation

Install the latest stable release:

pip install stealth-crawler

Or in isolation:

pipx install stealth-crawler

Or via other tools:

  • uv

    uv venv .venv
    source .venv/bin/activate
    uv pip install stealth-crawler
  • Poetry

    poetry add stealth-crawler

Quickstart

Terminal Command-Line

# Discover URLs only
stealth-crawler crawl https://example.com --urls-only

# Crawl and save HTML + Markdown
stealth-crawler crawl https://example.com \
  --save-html --save-md \
  --output-dir ./output

# Exclude specific paths
stealth-crawler crawl https://example.com \
  --exclude /private,/logout

Run stealth-crawler --help for full options.

Python Python API

import asyncio
from stealthcrawler import StealthCrawler

crawler = StealthCrawler(
    base="https://example.com",
    exclude=["/admin"],
    save_html=True,
    save_md=True,
    output_dir="export"
)
urls = asyncio.run(crawler.crawl("https://example.com"))
print(urls)

Configuration

Option CLI flag API param Default
Base URL(s) --base base start URL
Exclude paths --exclude exclude none
Save HTML --save-html save_html False
Save Markdown --save-md save_md False
URLs only --urls-only urls_only False
Output folder --output-dir output_dir ./output

Testing & Quality

  • Run tests:

    pytest
  • Check formatting & linting:

    black src tests
    ruff check src tests

Contributing

  1. Fork the repository and create a feature branch.

  2. Set up your development environment:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -e ".[dev]"

    Or with uv:

    uv venv .venv
    source .venv/bin/activate
    uv pip install -e ".[dev]"
  3. Implement your changes, add tests, and run:

    black src tests
    ruff check src tests
    pytest
  4. Open a pull request against main.


License

This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). You are free to use, modify, and redistribute under the terms of the GPL. See LICENSE for full details.

About

Asynchronous headless-Chrome web crawler that discovers internal links and optionally saves HTML, Markdown, screenshots, or PDFs. Built for scripting, inspection, and automation.

Topics

Resources

License

Stars

Watchers

Forks

Languages