This repository provides a CLI tool to scrape posts from Medium publications and custom domains (for example, engineering blogs). It follows Clean Architecture and provides:
- Intelligent discovery (auto-discovery, known IDs, fallbacks)
 - Support for custom domains and usernames
 - A rich CLI (progress indicators and formatted output via Rich)
 - YAML-based configuration for sources and bulk collections
 
This README is written to be fully reproducible: installation, configuration, commands, testing and troubleshooting are covered below.
Table of contents
- Overview
 - Quick start
 - Configuration (medium_sources.yaml)
 - CLI usage and examples
 - How it works (high-level)
 - Tests
 - Troubleshooting
 - Project layout
 - Contributing and license
 
- Entry point: 
main.py(callssrc.presentation.cli.cli()) - Config file: 
medium_sources.yaml(expected in repo root) - Default output folder: 
outputs/ - Python: 3.9+ (see 
pyproject.toml) 
This tool resolves publication definitions, discovers post IDs (auto-discovery or fallbacks), fetches post details via adapters, and presents results in table/JSON/IDs formats.
- Create and activate a virtual environment (Linux/macOS):
 
python -m venv .venv
source .venv/bin/activate- Install editable package (recommended):
 
pip install --upgrade pip
pip install -e .- Verify CLI is available and view help:
 
python main.py --helpNotes:
- Main dependencies are declared in 
pyproject.toml(httpx, click, rich, pyyaml, pytest, requests, etc.). - Installing in editable mode lets you make code changes without reinstalling.
 
The file medium_sources.yaml configures named sources and bulk collections. The included example in the repo contains many predefined keys (netflix, nytimes, airbnb, etc.).
Minimal example structure:
sources:
	netflix:
		type: publication
		name: netflix
		description: "Netflix Technology Blog"
		auto_discover: true
		custom_domain: false
defaults:
	limit: 50
	skip_session: true
	format: json
	output_dir: "outputs"
bulk_collections:
	tech_giants:
		description: "Major tech blogs"
		sources: [netflix, airbnb]Key notes:
type:publicationorusername.name: publication name, domain, or@username. Iftypeisusernameandnamelacks@, the code will add it.custom_domain: set totruefor domains likeopen.nytimes.com.
Use python main.py --list-sources to list keys and descriptions from the YAML file.
Run from project root or after installing into your venv.
- List configured sources:
 
python main.py --list-sources- Scrape a configured source and save JSON:
 
python main.py --source nytimes --limit 20 --format json --output outputs/nytimes_posts.json- Scrape a publication directly:
 
python main.py --publication netflix --limit 5 --format table- Bulk collection (group from YAML):
 
python main.py --bulk tech_giants --limit 10 --format json- Auto-discovery and skip session (production-ready):
 
python main.py --publication pinterest --auto-discover --skip-session --format json --output results.json- Custom post IDs (comma-separated). Each must be exactly 12 alphanumeric characters:
 
python main.py --publication netflix --custom-ids "ac15cada49ef,64c786c2a3ac" --format jsonFlags summary:
-p, --publicationTEXT-s, --sourceTEXT-b, --bulkTEXT--list-sources-o, --outputTEXT-f, --format[table|json|ids]--custom-idsTEXT (comma-separated)--skip-session--limitINTEGER--all(collect all posts)-m, --mode[ids|metadata|full|technical] # presets that control enrichment and output--index# create/update a simple inverted index (JSON) for search
You can add or update sources directly from the CLI using the add-source subcommand. This writes to medium_sources.yaml and is useful when you want to quickly register a publication without editing the YAML manually.
Example — add Pinterest:
python main.py add-source \
	--key pinterest \
	--type publication \
	--name pinterest \
	--description "Pinterest Engineering" \
	--auto-discoverNotes:
add-sourcepersists the change tomedium_sources.yamlin the repo root.- The command is implemented to avoid loading optional network adapters, so it can run even if dependencies like 
httpxare not installed. - To see the result, run 
python main.py --list-sources. 
Interactive behavior and safety
- 
If the source key you pass already exists in
medium_sources.yaml, the CLI will ask for confirmation before overwriting. This prevents accidental data loss when updating an existing source.Example (interactive prompt shown):
$ python main.py add-source --key pinterest --type publication --name pinterest --description "Pinterest Engineering" Source 'pinterest' already exists. Overwrite? [y/N]: y ✅ Source 'pinterest' added/updated in medium_sources.yaml - 
To skip the interactive prompt and assume confirmation, use
--yes(or-y):python main.py add-source --key pinterest --type publication --name pinterest --description "Pinterest Engineering" --yes - 
The CLI subcommand writes a normalized YAML entry (ensures booleans and required keys). It creates the
sourcesblock if it does not exist. - 
After adding/updating a source you can:
- run 
python main.py --list-sourcesto see the configured keys and descriptions; or - open 
medium_sources.yamlto inspect the persisted entry. 
 - run 
 
Implementation notes (for maintainers)
- The 
add-sourcesubcommand avoids importing network adapters (e.g.httpx) when invoked so it can be used on systems where optional runtime dependencies are not installed. - The command is implemented in 
src/presentation/cli.pyand usesSourceConfigManager.add_or_update_source(insrc/infrastructure/config/source_manager.py) to persist changes. 
- The CLI bootstraps concrete adapters and repositories (e.g. 
MediumApiAdapter,InMemoryPublicationRepository,MediumSessionRepository). - It creates domain services: 
PostDiscoveryService,PublicationConfigService. ScrapePostsUseCaseorchestrates the flow: resolve config, initialize session (unless skipped), handle custom IDs or auto-discovery, collect posts.- The use case returns 
ScrapePostsResponsecontainingPostentities. The CLI formats and optionally saves the response. 
Run all tests:
python -m pytest tests/ -vRun only unit or integration tests:
python -m pytest tests/unit/ -v
python -m pytest tests/integration/ -vCoverage reports are configured in pyproject.toml and generate htmlcov/.
Latest test run (local): TOTAL coverage 84%.
- HTML report: 
htmlcov/index.html(generated by pytest-cov) - XML report: 
coverage.xml 
Notes:
- Coverage is computed with pytest-cov. The HTML report lives in 
htmlcov/after runningpytest --cov=src --cov-report=html. - Some adapter branches remain partially covered; see 
src/infrastructure/adapters/medium_api_adapter.pyfor areas to target next. - If you want a live badge that updates automatically, add CI with coverage upload (Codecov or Coveralls). See the 
Add CI with GitHub Actionstask in the project TODO. 
- Missing 
medium_sources.yaml:SourceConfigManagerraisesFileNotFoundErrorwhen calling--sourceor--list-sources. - Custom IDs validation: 
PostIdrequires exactly 12 alphanumeric characters; invalid IDs will raise a validation error. - Empty result / errors: the use case catches exceptions and returns an empty response; the CLI prints helpful troubleshooting tips. Use logging or run in a development environment to debug further.
 - Output directory: default 
outputs/(CLI will create it if missing). 
main.py— entry pointsrc/presentation/cli.py— CLI orchestration, formatting and progress UIsrc/application/use_cases/scrape_posts.py— main use casesrc/domain/entities/publication.py— domain entitiessrc/infrastructure/config/source_manager.py— YAML loadersrc/infrastructure/adapters/medium_api_adapter.py— adapter for external API logicsrc/infrastructure/content_extractor.py— HTML → Markdown conversion, code extraction and heuristics-based classifiersrc/infrastructure/persistence.py— persisting Markdown, JSON metadata and assetssrc/infrastructure/indexer.py— simple inverted index writer used by persistence
The repository already includes CONTRIBUTING.md with development workflow, testing and coding standards. Please follow it when contributing.
MIT — see LICENSE for details.
This project now includes a technical extraction pipeline that converts full post HTML into:
- A Markdown rendering of the post content
 - Extracted assets (images and other linked files), downloaded locally
 - A JSON metadata file per post that includes extracted code blocks and a lightweight "technical" classifier
 - An optional inverted index (
index.json) for simple token-based search 
These artifacts are created when the CLI runs with --mode full or --mode technical and when the chosen output --format includes md or when --index is passed.
Typical outputs (when using --format md and --index):
outputs/<source_key>/<post_id>.md— Markdown contentoutputs/<source_key>/<post_id>.json— Metadata (title, authors, date, code_blocks, classifier, original URL)outputs/<source_key>/assets/<post_id>/<asset_filename>— downloaded assets referenced from the postoutputs/index.json— simple inverted index mapping tokens to posts (updated when--indexis requested)
Example usage:
python main.py --source pinterest --mode technical --format md --output outputs/pinterest --indexNotes for maintainers and contributors
- The HTML→Markdown conversion lives in 
src/infrastructure/content_extractor.py. It uses BeautifulSoup andmarkdownifywhen available and falls back to conservative HTML cleaning heuristics otherwise. - Code block extraction and language detection are implemented with heuristics and a Pygments-based fallback; results appear under the 
code_blockskey in the per-post JSON metadata. - The lightweight classifier is intentionally heuristic for now (presence/volume of code blocks, presence of technical keywords). A machine-learning classifier can be plugged in later — the extractor and persistence functions accept optional hooks for that.
 - Persistence (writing 
.md,.json, downloading assets and updating the index) is implemented insrc/infrastructure/persistence.pyand usessrc/infrastructure/indexer.pyto maintainindex.json. 
Security and safety
- Filenames for assets are sanitized before writing to disk. The persistence layer will not overwrite existing files unless explicitly allowed.
 - The index is intentionally simple and file-based; for larger collections consider migrating to a proper search engine (Elasticsearch, Meilisearch, SQLite FTS, etc.).
 
Where to start when improving extraction
- Improve Markdown fidelity: tweak 
content_extractor.html_to_markdown()and add post-processing steps for link rewriting and imagesrcnormalization. - Harden asset handling: dedupe asset downloads, support remote storage backends, and canonicalize filenames.
 - Replace the heuristic classifier with an ML model: extractor already records 
code_blocksand features to make training easier. 
If you'd like, I can add a book/ chapter stub that documents the extraction flow and includes exercises for readers — choose "yes" and I'll create it now.
Next steps I can take (pick one):
- run the test suite and report results
 - run a sample scrape and include a short JSON example
 - add a 
--verboseflag to the CLI to improve debug output