Skip to content

Supports any Medium publication with intelligent post discovery and modern visual interface.

License

Notifications You must be signed in to change notification settings

BehindTheStack/medium-scrap

Repository files navigation

Universal Medium Scraper — Complete Edition

This repository provides a CLI tool to scrape posts from Medium publications and custom domains (for example, engineering blogs). It follows Clean Architecture and provides:

  • Intelligent discovery (auto-discovery, known IDs, fallbacks)
  • Support for custom domains and usernames
  • A rich CLI (progress indicators and formatted output via Rich)
  • YAML-based configuration for sources and bulk collections

This README is written to be fully reproducible: installation, configuration, commands, testing and troubleshooting are covered below.

Table of contents

  • Overview
  • Quick start
  • Configuration (medium_sources.yaml)
  • CLI usage and examples
  • How it works (high-level)
  • Tests
  • Troubleshooting
  • Project layout
  • Contributing and license

Overview

  • Entry point: main.py (calls src.presentation.cli.cli())
  • Config file: medium_sources.yaml (expected in repo root)
  • Default output folder: outputs/
  • Python: 3.9+ (see pyproject.toml)

This tool resolves publication definitions, discovers post IDs (auto-discovery or fallbacks), fetches post details via adapters, and presents results in table/JSON/IDs formats.

Quick start

  1. Create and activate a virtual environment (Linux/macOS):
python -m venv .venv
source .venv/bin/activate
  1. Install editable package (recommended):
pip install --upgrade pip
pip install -e .
  1. Verify CLI is available and view help:
python main.py --help

Notes:

  • Main dependencies are declared in pyproject.toml (httpx, click, rich, pyyaml, pytest, requests, etc.).
  • Installing in editable mode lets you make code changes without reinstalling.

Configuration (medium_sources.yaml)

The file medium_sources.yaml configures named sources and bulk collections. The included example in the repo contains many predefined keys (netflix, nytimes, airbnb, etc.).

Minimal example structure:

sources:
	netflix:
		type: publication
		name: netflix
		description: "Netflix Technology Blog"
		auto_discover: true
		custom_domain: false

defaults:
	limit: 50
	skip_session: true
	format: json
	output_dir: "outputs"

bulk_collections:
	tech_giants:
		description: "Major tech blogs"
		sources: [netflix, airbnb]

Key notes:

  • type: publication or username.
  • name: publication name, domain, or @username. If type is username and name lacks @, the code will add it.
  • custom_domain: set to true for domains like open.nytimes.com.

Use python main.py --list-sources to list keys and descriptions from the YAML file.

CLI usage and examples

Run from project root or after installing into your venv.

  • List configured sources:
python main.py --list-sources
  • Scrape a configured source and save JSON:
python main.py --source nytimes --limit 20 --format json --output outputs/nytimes_posts.json
  • Scrape a publication directly:
python main.py --publication netflix --limit 5 --format table
  • Bulk collection (group from YAML):
python main.py --bulk tech_giants --limit 10 --format json
  • Auto-discovery and skip session (production-ready):
python main.py --publication pinterest --auto-discover --skip-session --format json --output results.json
  • Custom post IDs (comma-separated). Each must be exactly 12 alphanumeric characters:
python main.py --publication netflix --custom-ids "ac15cada49ef,64c786c2a3ac" --format json

Flags summary:

  • -p, --publication TEXT
  • -s, --source TEXT
  • -b, --bulk TEXT
  • --list-sources
  • -o, --output TEXT
  • -f, --format [table|json|ids]
  • --custom-ids TEXT (comma-separated)
  • --skip-session
  • --limit INTEGER
  • --all (collect all posts)
  • -m, --mode [ids|metadata|full|technical] # presets that control enrichment and output
  • --index # create/update a simple inverted index (JSON) for search

Managing sources via CLI

You can add or update sources directly from the CLI using the add-source subcommand. This writes to medium_sources.yaml and is useful when you want to quickly register a publication without editing the YAML manually.

Example — add Pinterest:

python main.py add-source \
	--key pinterest \
	--type publication \
	--name pinterest \
	--description "Pinterest Engineering" \
	--auto-discover

Notes:

  • add-source persists the change to medium_sources.yaml in the repo root.
  • The command is implemented to avoid loading optional network adapters, so it can run even if dependencies like httpx are not installed.
  • To see the result, run python main.py --list-sources.

Interactive behavior and safety

  • If the source key you pass already exists in medium_sources.yaml, the CLI will ask for confirmation before overwriting. This prevents accidental data loss when updating an existing source.

    Example (interactive prompt shown):

     $ python main.py add-source --key pinterest --type publication --name pinterest --description "Pinterest Engineering"
     Source 'pinterest' already exists. Overwrite? [y/N]: y
     ✅ Source 'pinterest' added/updated in medium_sources.yaml
    
  • To skip the interactive prompt and assume confirmation, use --yes (or -y):

     python main.py add-source --key pinterest --type publication --name pinterest --description "Pinterest Engineering" --yes
  • The CLI subcommand writes a normalized YAML entry (ensures booleans and required keys). It creates the sources block if it does not exist.

  • After adding/updating a source you can:

    • run python main.py --list-sources to see the configured keys and descriptions; or
    • open medium_sources.yaml to inspect the persisted entry.

Implementation notes (for maintainers)

  • The add-source subcommand avoids importing network adapters (e.g. httpx) when invoked so it can be used on systems where optional runtime dependencies are not installed.
  • The command is implemented in src/presentation/cli.py and uses SourceConfigManager.add_or_update_source (in src/infrastructure/config/source_manager.py) to persist changes.

How it works (high-level)

  1. The CLI bootstraps concrete adapters and repositories (e.g. MediumApiAdapter, InMemoryPublicationRepository, MediumSessionRepository).
  2. It creates domain services: PostDiscoveryService, PublicationConfigService.
  3. ScrapePostsUseCase orchestrates the flow: resolve config, initialize session (unless skipped), handle custom IDs or auto-discovery, collect posts.
  4. The use case returns ScrapePostsResponse containing Post entities. The CLI formats and optionally saves the response.

Tests

Run all tests:

python -m pytest tests/ -v

Run only unit or integration tests:

python -m pytest tests/unit/ -v
python -m pytest tests/integration/ -v

Coverage reports are configured in pyproject.toml and generate htmlcov/.

Test coverage

coverage

Latest test run (local): TOTAL coverage 84%.

  • HTML report: htmlcov/index.html (generated by pytest-cov)
  • XML report: coverage.xml

Notes:

  • Coverage is computed with pytest-cov. The HTML report lives in htmlcov/ after running pytest --cov=src --cov-report=html.
  • Some adapter branches remain partially covered; see src/infrastructure/adapters/medium_api_adapter.py for areas to target next.
  • If you want a live badge that updates automatically, add CI with coverage upload (Codecov or Coveralls). See the Add CI with GitHub Actions task in the project TODO.

Troubleshooting & important notes

  • Missing medium_sources.yaml: SourceConfigManager raises FileNotFoundError when calling --source or --list-sources.
  • Custom IDs validation: PostId requires exactly 12 alphanumeric characters; invalid IDs will raise a validation error.
  • Empty result / errors: the use case catches exceptions and returns an empty response; the CLI prints helpful troubleshooting tips. Use logging or run in a development environment to debug further.
  • Output directory: default outputs/ (CLI will create it if missing).

Files to inspect when extending or debugging

  • main.py — entry point
  • src/presentation/cli.py — CLI orchestration, formatting and progress UI
  • src/application/use_cases/scrape_posts.py — main use case
  • src/domain/entities/publication.py — domain entities
  • src/infrastructure/config/source_manager.py — YAML loader
  • src/infrastructure/adapters/medium_api_adapter.py — adapter for external API logic
  • src/infrastructure/content_extractor.py — HTML → Markdown conversion, code extraction and heuristics-based classifier
  • src/infrastructure/persistence.py — persisting Markdown, JSON metadata and assets
  • src/infrastructure/indexer.py — simple inverted index writer used by persistence

Contributing

The repository already includes CONTRIBUTING.md with development workflow, testing and coding standards. Please follow it when contributing.

License

MIT — see LICENSE for details.


Technical extraction (HTML → Markdown & artifacts)

This project now includes a technical extraction pipeline that converts full post HTML into:

  • A Markdown rendering of the post content
  • Extracted assets (images and other linked files), downloaded locally
  • A JSON metadata file per post that includes extracted code blocks and a lightweight "technical" classifier
  • An optional inverted index (index.json) for simple token-based search

These artifacts are created when the CLI runs with --mode full or --mode technical and when the chosen output --format includes md or when --index is passed.

Typical outputs (when using --format md and --index):

  • outputs/<source_key>/<post_id>.md — Markdown content
  • outputs/<source_key>/<post_id>.json — Metadata (title, authors, date, code_blocks, classifier, original URL)
  • outputs/<source_key>/assets/<post_id>/<asset_filename> — downloaded assets referenced from the post
  • outputs/index.json — simple inverted index mapping tokens to posts (updated when --index is requested)

Example usage:

python main.py --source pinterest --mode technical --format md --output outputs/pinterest --index

Notes for maintainers and contributors

  • The HTML→Markdown conversion lives in src/infrastructure/content_extractor.py. It uses BeautifulSoup and markdownify when available and falls back to conservative HTML cleaning heuristics otherwise.
  • Code block extraction and language detection are implemented with heuristics and a Pygments-based fallback; results appear under the code_blocks key in the per-post JSON metadata.
  • The lightweight classifier is intentionally heuristic for now (presence/volume of code blocks, presence of technical keywords). A machine-learning classifier can be plugged in later — the extractor and persistence functions accept optional hooks for that.
  • Persistence (writing .md, .json, downloading assets and updating the index) is implemented in src/infrastructure/persistence.py and uses src/infrastructure/indexer.py to maintain index.json.

Security and safety

  • Filenames for assets are sanitized before writing to disk. The persistence layer will not overwrite existing files unless explicitly allowed.
  • The index is intentionally simple and file-based; for larger collections consider migrating to a proper search engine (Elasticsearch, Meilisearch, SQLite FTS, etc.).

Where to start when improving extraction

  1. Improve Markdown fidelity: tweak content_extractor.html_to_markdown() and add post-processing steps for link rewriting and image src normalization.
  2. Harden asset handling: dedupe asset downloads, support remote storage backends, and canonicalize filenames.
  3. Replace the heuristic classifier with an ML model: extractor already records code_blocks and features to make training easier.

If you'd like, I can add a book/ chapter stub that documents the extraction flow and includes exercises for readers — choose "yes" and I'll create it now.

Next steps I can take (pick one):

  • run the test suite and report results
  • run a sample scrape and include a short JSON example
  • add a --verbose flag to the CLI to improve debug output

About

Supports any Medium publication with intelligent post discovery and modern visual interface.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages