A discipline-specific bibliometric index for Rhetoric & Composition. Live at pinakes.xyz.
The index covers 44+ journals and 50,000+ articles, drawn from CrossRef, OpenAlex, RSS feeds, and a handful of custom scrapers (each scraper is rate-limited and respects each site's robots.txt; scraper.py carries inline ethics annotations per source). Records are stored in SQLite, served by Flask, and visualised with D3.js.
A single Flask process serves HTML and a JSON API; a separate APScheduler process runs daily fetches. Both share a SQLite file on a Fly.io persistent volume. Citation networks, co-citation graphs, and other analytics are computed on demand by db.py using NetworkX. Authentication on mutating endpoints uses a single shared bearer token; rate limiting uses Flask-Limiter with in-memory storage.
Client ── HTTPS ──▶ Fly.io edge ──▶ gunicorn (1 worker) ──▶ Flask (app.py)
│
SQLite (WAL) ◀──┤
/data/articles.db
│
GitHub Actions cron (03:00 UTC) ──▶ POST /fetch ──▶ fetcher / rss / scraper
──▶ POST /api/admin/run-backup ──▶ B2
git clone https://github.com/justalewis/Rhet-Comp-Index.git
cd Rhet-Comp-Index
python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install -r requirements-dev.txt # adds pytest, responses, freezegun, coverage
python app.py # http://localhost:5000The first run creates an empty articles.db. To populate it:
python fetcher.py # CrossRef-indexed journals (the bulk of the corpus)
python rss_fetcher.py # RSS / OAI / WordPress feed journals
python scraper.py # custom HTML scrapersThese can take a long time on a cold corpus. In production, .github/workflows/cron.yml runs them automatically each night.
pytest # full suite (fast)
pytest -m "not slow" # explicit fast suite
pytest --cov=. --cov-report=term-missing # with coverageThe harness uses an isolated SQLite file per test, stubs all HTTP via responses and feedparser mocks, and never touches the developer's real articles.db. CI runs the suite on every push and pull request; the Fly deploy is gated on test passage.
Pinakes deploys to Fly.io as a single-machine application: one gunicorn worker, one SQLite file on a persistent volume. Daily fetches and nightly backups are triggered externally by .github/workflows/cron.yml at 03:00 UTC, which POSTs to /fetch and /api/admin/run-backup respectively.
(An earlier design ran the scheduler as a second Fly process group with its own machine. That didn't work because Fly volumes are single-attach: the scheduler had no way to share /data with the app machine. See docs/refactor-notes/13-scheduler-architecture-fix.md.)
# Required: generate and store the admin token (used by /fetch and /health/deep)
flyctl secrets set PINAKES_ADMIN_TOKEN=$(python -c "import secrets; print(secrets.token_urlsafe(32))")
# Optional: hook up Sentry error monitoring (free tier is fine)
flyctl secrets set SENTRY_DSN='https://...@...ingest.sentry.io/...'The Sentry DSN is optional. Without it, monitoring.py is a no-op and errors only surface in flyctl logs. With it, the web process reports errors with component=web; ingestion errors are additionally tagged with source=crossref|rss|scrape|openalex|citations and (when known) journal=<name>.
Pinakes triggers its daily fetch and nightly backup via .github/workflows/cron.yml, which hits POST /fetch and POST /api/admin/run-backup at 03:00 UTC. Both endpoints require PINAKES_ADMIN_TOKEN.
To enable, add PINAKES_ADMIN_TOKEN as a GitHub Actions repository secret (Settings → Secrets and variables → Actions → New repository secret). Use the same value you set as the Fly secret.
The workflow can also be triggered manually from the Actions tab (workflow_dispatch).
Backups run inside the cron workflow above: POST /api/admin/run-backup snapshots the DB, compresses with zstd, encrypts with age, uploads to an S3-compatible bucket, and prunes per the 30-daily / 26-weekly / 12-monthly retention policy. Each successful run writes /data/scheduler.heartbeat so /health/deep reports scheduler_healthy: true. See docs/runbooks/disaster-recovery.md for restoration.
# 1. Generate an age key pair locally; KEEP the private key off Fly.
age-keygen -o ~/.pinakes/age.key
PUB=$(grep '^# public key:' ~/.pinakes/age.key | cut -d' ' -f4)
# 2. Create a Backblaze B2 bucket "pinakes-backup" and an application key
# scoped to that bucket. Note its keyID and applicationKey.
# 3. Set six Fly secrets:
flyctl secrets set \
PINAKES_BACKUP_BUCKET=pinakes-backup \
PINAKES_BACKUP_ENDPOINT=https://s3.us-west-002.backblazeb2.com \
PINAKES_BACKUP_REGION=us-west-002 \
PINAKES_BACKUP_ACCESS_KEY_ID=<keyID> \
PINAKES_BACKUP_SECRET_KEY=<applicationKey> \
PINAKES_BACKUP_AGE_PUBLIC_KEY=$PUBThe age private key is the most important secret in this project. Store it in a password manager AND a paper backup. If you lose it, every backup becomes unrecoverable.
You can verify backups by triggering one manually and inspecting the resulting B2 object:
curl -X POST -H "Authorization: Bearer $PINAKES_ADMIN_TOKEN" \
https://pinakes.xyz/api/admin/run-backup | jqTo restore manually:
python restore.py --list
python restore.py --latest --out ./restored.db --age-key ~/.pinakes/age.key| Endpoint | Auth | Purpose |
|---|---|---|
GET /health |
none | Liveness — process is up. <50ms, no DB. |
GET /health/ready |
none | Readiness — DB is reachable. Used by Fly's check loop. |
GET /health/deep |
admin token | Full diagnostic — counts, last-fetch, disk, scheduler heartbeat, integrity check. |
curl -H "Authorization: Bearer $PINAKES_ADMIN_TOKEN" https://pinakes.xyz/health/deep | jqThe cron workflow runs daily. To force a fetch early:
curl -X POST -H "Authorization: Bearer $PINAKES_ADMIN_TOKEN" https://pinakes.xyz/fetchRhet-Comp-Index/
├── app.py Flask web server, all routes
├── auth.py Admin-token decorator
├── rate_limit.py Flask-Limiter configuration
├── health.py /health, /health/ready, /health/deep
├── db.py SQLite layer + analytics queries
├── journals.py Journal definitions (CrossRef, RSS, scrape, manual)
├── tagger.py Controlled-vocabulary auto-tagging
├── (scheduler.py removed in sched-fix; cron lives in .github/workflows/cron.yml)
│
├── fetcher.py CrossRef API ingester
├── rss_fetcher.py RSS / OAI-PMH / WordPress ingester
├── scraper.py Per-journal HTML scrapers
│
├── enrich.py Wrapper coordinating enrichment passes
├── enrich_openalex.py OpenAlex affiliation + abstract enrichment
├── openalex_citations.py OpenAlex citation backfill
├── cite_fetcher.py CrossRef references → citation edges
├── backfill_abstracts.py Back-fill missing abstracts via OpenAlex
├── book_fetcher.py CrossRef book + chapter ingester
├── fetch_institutions.py Institution affiliation enrichment
│
├── coverage_report.py Coverage snapshot generator (used by /coverage)
├── crossref_book_probe.py Probe for which publishers index in CrossRef
├── cull_upc.py One-off: prune unwanted UP-only chapters
├── ingest_peer_review_1_1.py One-off: load Peer Review 1.1 references
├── probe_new_publishers.py Exploratory: find new publishers to add
├── retag.py Re-run the auto-tagger on existing rows
├── scrape_ccdp.py One-off: scrape CCDP catalogue
├── scrape_lics_refs.py One-off: scrape LiCS reference lists
├── seed_usu_rhet_comp.py One-off: seed USU Press records
├── weekly_maintenance.py Wrapper invoked weekly on Fly
│
├── fetch_parlor.py ┐
├── fetch_pitt.py ├─ Per-press book scrapers, run on demand
├── fetch_routledge.py │
├── fetch_siup.py ┘
│
├── templates/ Jinja2 templates (base.html → base-core.html)
├── static/ style.css + theme variants + explore.js (D3)
├── data/seeds/ Hand-curated ingestion inputs (see README there)
├── docs/ Architecture notes, methodology, refactor notes
├── tests/ Pytest harness (~ 320 tests)
│
├── conftest.py Test fixtures
├── pytest.ini Test config
├── requirements.txt Runtime deps
├── requirements-dev.txt Dev / test deps
├── Dockerfile Multi-stage build for Fly
├── fly.toml Fly deployment config (single-machine)
└── articles.db SQLite database (gitignored; on /data in prod)
| Document | For |
|---|---|
| Architecture | How Pinakes is built — system overview, data flow, ingestion, deployment |
| Methodology | What each Explore tool measures, how it's computed, and the scholarly references |
| Journal coverage | Every venue indexed by Pinakes with its ingestion path |
| Disaster recovery runbook | Restoring from off-machine backups |
| Refactor notes | Audit trail for the structural improvements (sessions A1 through G1) |
External contributions are rare; the project is maintained by one person, but bug reports and small PRs are welcome. See CONTRIBUTING.md for the testing requirements and the scraping-ethics review that applies to any new scraper.
Released under the GNU GPL 3.0.
A research note describing the index is in preparation for the Journal of Writing Analytics. Until that lands, please cite the repository:
@misc{lewis_pinakes_2026,
author = {Lewis, Justin},
title = {Pinakes: A Bibliometric Index for Rhetoric and Composition},
year = {2026},
url = {https://github.com/justalewis/Rhet-Comp-Index},
}