Markdown Scraper 🚀

A robust, developer-centric Python web scraper designed to transform complex web content into clean, high-fidelity GitHub Flavored Markdown (GFM). Perfect for archival, research, LLM context gathering, and technical documentation.

✨ Key Features

Intelligent Extraction: Uses heuristics to identify the main content area (article, main, etc.) while stripping away navigation, ads, and footers.
High-Fidelity GFM: Produces clean Markdown including tables, code blocks (with language detection), and links.
Static & Dynamic Support:
- Fast static scraping using BeautifulSoup4 & requests.
- Robust dynamic scraping for JS-heavy SPAs using Playwright.
Metadata Extraction: Automatically captures title, author, date, and descriptions from JSON-LD, OpenGraph, and standard meta tags.
Multiple Interfaces:
- CLI: Powerful command-line tool for terminal workflows.
- Web UI: Lightweight Flask-based interface for browser-based scraping.
- Python Library: Clean API for integration into your own Python projects.
Highly Configurable: Toggle image stripping, link removal, and more.

🛠️ Tech Stack

Backend: Python 3.12+
Scraping: BeautifulSoup4, Playwright, requests, lxml
Conversion: markdownify
CLI: Click
Web UI: Flask + Pico.css
Testing: pytest, pytest-cov, pytest-flask

🚀 Getting Started

Prerequisites

Poetry installed on your system.

Installation

git clone https://github.com/yourusername/md-scraper.git
cd md-scraper
poetry install

# For global CLI access (optional but recommended):
pip install --editable .

Note for Termux users: If developing on Android/Termux, install Chromium via pkg install chromium and set the CHROMIUM_PATH environment variable. Playwright installation via pip is not supported on Termux; use the --server option (see below) or the Web UI for dynamic scraping.

Usage

Command Line Interface (CLI)

The CLI is available via the scraper command (if installed globally) or via poetry run scraper:

# Scrape a static page to stdout
scraper scrape https://example.com/article

# Scrape and save to a file
scraper scrape https://example.com/article -o my_article.md

# Remote Scraping (Recommended for Termux/Dynamic sites)
# Offloads the scraping (including Playwright/JS rendering) to your deployed server
scraper scrape https://react-site.com --server https://scraper-751660269987.us-central1.run.app

# Dynamic scraping (Local - requires Playwright installed)
scraper scrape https://react-site.com --dynamic

# Strip specific tags (e.g., remove all images and links)
scraper scrape https://example.com -s img -s a

Interactive Batch Script (`scraper-go`)

An interactive Bash script is included for easy batch processing and file management.

./scraper-go.sh

Features:

Interactive Input: Accepts single URL, multiple URLs (space-separated), or a .txt batch file.
Auto-Naming: Automatically generates filenames based on the page title (H1/H2).
Batch Organization: Options to save all outputs to a specific directory.
Remote Enabled: Configured to use your deployed Cloud Run instance by default.

Web Interface

Start the local development server:

poetry run python src/md_scraper/web/app.py

Open your browser and navigate to http://127.0.0.1:5000.

Cloud Deployment

The application is live on Google Cloud Run: https://scraper-751660269987.us-central1.run.app

It exposes a REST API at /api/scrape for remote CLI usage.

As a Python Library

from md_scraper.scraper import Scraper

scraper = Scraper()
result = scraper.scrape("https://example.com/blog-post", dynamic=False)

print(result['metadata']['title'])
print(result['markdown'])

🧪 Testing

We maintain a comprehensive test suite with high coverage.

# Run all tests
poetry run pytest

# Run with coverage report
poetry run pytest --cov=md_scraper

🏗️ Architecture

src/md_scraper/scraper.py: Core Scraper class logic.
src/md_scraper/cli.py: CLI entry point and command definitions.
src/md_scraper/web/: Flask application and Jinja2 templates.
tests/: Unit and integration tests.

🤝 Contributing

Contributions are welcome! Please ensure you:

Follow the existing code style (Google-style docstrings).
Add tests for any new features or bug fixes.
Ensure the full test suite passes.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
src/md_scraper		src/md_scraper
tests		tests
.gcloudignore		.gcloudignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scraper-go.sh		scraper-go.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Markdown Scraper 🚀

✨ Key Features

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Installation

Usage

Command Line Interface (CLI)

Interactive Batch Script (`scraper-go`)

Web Interface

Cloud Deployment

As a Python Library

🧪 Testing

🏗️ Architecture

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

ray0404/PyScraper-md

Folders and files

Latest commit

History

Repository files navigation

Markdown Scraper 🚀

✨ Key Features

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Installation

Usage

Command Line Interface (CLI)

Interactive Batch Script (scraper-go)

Web Interface

Cloud Deployment

As a Python Library

🧪 Testing

🏗️ Architecture

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Interactive Batch Script (`scraper-go`)

Packages