Skip to content

justalewis/Rhet-Comp-Index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

215 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pinakes

Python 3.12 License: GPL-3.0 Tests

A discipline-specific bibliometric index for Rhetoric & Composition. Live at pinakes.xyz.

The index covers 44+ journals and 50,000+ articles, drawn from CrossRef, OpenAlex, RSS feeds, and a handful of custom scrapers (each scraper is rate-limited and respects each site's robots.txt; scraper.py carries inline ethics annotations per source). Records are stored in SQLite, served by Flask, and visualised with D3.js.

Architecture

A single Flask process serves HTML and a JSON API; a separate APScheduler process runs daily fetches. Both share a SQLite file on a Fly.io persistent volume. Citation networks, co-citation graphs, and other analytics are computed on demand by db.py using NetworkX. Authentication on mutating endpoints uses a single shared bearer token; rate limiting uses Flask-Limiter with in-memory storage.

Client ── HTTPS ──▶ Fly.io edge ──▶ gunicorn (1 worker) ──▶ Flask (app.py)
                                                              │
                                              SQLite (WAL) ◀──┤
                                              /data/articles.db
                                                              │
GitHub Actions cron (03:00 UTC) ──▶ POST /fetch ──▶ fetcher / rss / scraper
                                ──▶ POST /api/admin/run-backup ──▶ B2

Local development

git clone https://github.com/justalewis/Rhet-Comp-Index.git
cd Rhet-Comp-Index

python -m venv .venv && source .venv/bin/activate    # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install -r requirements-dev.txt                  # adds pytest, responses, freezegun, coverage

python app.py            # http://localhost:5000

The first run creates an empty articles.db. To populate it:

python fetcher.py        # CrossRef-indexed journals (the bulk of the corpus)
python rss_fetcher.py    # RSS / OAI / WordPress feed journals
python scraper.py        # custom HTML scrapers

These can take a long time on a cold corpus. In production, .github/workflows/cron.yml runs them automatically each night.

Running tests

pytest                                       # full suite (fast)
pytest -m "not slow"                         # explicit fast suite
pytest --cov=. --cov-report=term-missing     # with coverage

The harness uses an isolated SQLite file per test, stubs all HTTP via responses and feedparser mocks, and never touches the developer's real articles.db. CI runs the suite on every push and pull request; the Fly deploy is gated on test passage.

Deployment

Pinakes deploys to Fly.io as a single-machine application: one gunicorn worker, one SQLite file on a persistent volume. Daily fetches and nightly backups are triggered externally by .github/workflows/cron.yml at 03:00 UTC, which POSTs to /fetch and /api/admin/run-backup respectively.

(An earlier design ran the scheduler as a second Fly process group with its own machine. That didn't work because Fly volumes are single-attach: the scheduler had no way to share /data with the app machine. See docs/refactor-notes/13-scheduler-architecture-fix.md.)

One-time setup

# Required: generate and store the admin token (used by /fetch and /health/deep)
flyctl secrets set PINAKES_ADMIN_TOKEN=$(python -c "import secrets; print(secrets.token_urlsafe(32))")

# Optional: hook up Sentry error monitoring (free tier is fine)
flyctl secrets set SENTRY_DSN='https://...@...ingest.sentry.io/...'

The Sentry DSN is optional. Without it, monitoring.py is a no-op and errors only surface in flyctl logs. With it, the web process reports errors with component=web; ingestion errors are additionally tagged with source=crossref|rss|scrape|openalex|citations and (when known) journal=<name>.

Daily fetch + backup (cron-driven)

Pinakes triggers its daily fetch and nightly backup via .github/workflows/cron.yml, which hits POST /fetch and POST /api/admin/run-backup at 03:00 UTC. Both endpoints require PINAKES_ADMIN_TOKEN.

To enable, add PINAKES_ADMIN_TOKEN as a GitHub Actions repository secret (Settings → Secrets and variables → Actions → New repository secret). Use the same value you set as the Fly secret.

The workflow can also be triggered manually from the Actions tab (workflow_dispatch).

Backups (recommended)

Backups run inside the cron workflow above: POST /api/admin/run-backup snapshots the DB, compresses with zstd, encrypts with age, uploads to an S3-compatible bucket, and prunes per the 30-daily / 26-weekly / 12-monthly retention policy. Each successful run writes /data/scheduler.heartbeat so /health/deep reports scheduler_healthy: true. See docs/runbooks/disaster-recovery.md for restoration.

# 1. Generate an age key pair locally; KEEP the private key off Fly.
age-keygen -o ~/.pinakes/age.key
PUB=$(grep '^# public key:' ~/.pinakes/age.key | cut -d' ' -f4)

# 2. Create a Backblaze B2 bucket "pinakes-backup" and an application key
#    scoped to that bucket. Note its keyID and applicationKey.

# 3. Set six Fly secrets:
flyctl secrets set \
  PINAKES_BACKUP_BUCKET=pinakes-backup \
  PINAKES_BACKUP_ENDPOINT=https://s3.us-west-002.backblazeb2.com \
  PINAKES_BACKUP_REGION=us-west-002 \
  PINAKES_BACKUP_ACCESS_KEY_ID=<keyID> \
  PINAKES_BACKUP_SECRET_KEY=<applicationKey> \
  PINAKES_BACKUP_AGE_PUBLIC_KEY=$PUB

The age private key is the most important secret in this project. Store it in a password manager AND a paper backup. If you lose it, every backup becomes unrecoverable.

You can verify backups by triggering one manually and inspecting the resulting B2 object:

curl -X POST -H "Authorization: Bearer $PINAKES_ADMIN_TOKEN" \
  https://pinakes.xyz/api/admin/run-backup | jq

To restore manually:

python restore.py --list
python restore.py --latest --out ./restored.db --age-key ~/.pinakes/age.key

Health endpoints

Endpoint Auth Purpose
GET /health none Liveness — process is up. <50ms, no DB.
GET /health/ready none Readiness — DB is reachable. Used by Fly's check loop.
GET /health/deep admin token Full diagnostic — counts, last-fetch, disk, scheduler heartbeat, integrity check.
curl -H "Authorization: Bearer $PINAKES_ADMIN_TOKEN" https://pinakes.xyz/health/deep | jq

Triggering a fetch manually

The cron workflow runs daily. To force a fetch early:

curl -X POST -H "Authorization: Bearer $PINAKES_ADMIN_TOKEN" https://pinakes.xyz/fetch

Project structure

Rhet-Comp-Index/
├── app.py                       Flask web server, all routes
├── auth.py                      Admin-token decorator
├── rate_limit.py                Flask-Limiter configuration
├── health.py                    /health, /health/ready, /health/deep
├── db.py                        SQLite layer + analytics queries
├── journals.py                  Journal definitions (CrossRef, RSS, scrape, manual)
├── tagger.py                    Controlled-vocabulary auto-tagging
├── (scheduler.py removed in sched-fix; cron lives in .github/workflows/cron.yml)
│
├── fetcher.py                   CrossRef API ingester
├── rss_fetcher.py               RSS / OAI-PMH / WordPress ingester
├── scraper.py                   Per-journal HTML scrapers
│
├── enrich.py                    Wrapper coordinating enrichment passes
├── enrich_openalex.py           OpenAlex affiliation + abstract enrichment
├── openalex_citations.py        OpenAlex citation backfill
├── cite_fetcher.py              CrossRef references → citation edges
├── backfill_abstracts.py        Back-fill missing abstracts via OpenAlex
├── book_fetcher.py              CrossRef book + chapter ingester
├── fetch_institutions.py        Institution affiliation enrichment
│
├── coverage_report.py           Coverage snapshot generator (used by /coverage)
├── crossref_book_probe.py       Probe for which publishers index in CrossRef
├── cull_upc.py                  One-off: prune unwanted UP-only chapters
├── ingest_peer_review_1_1.py    One-off: load Peer Review 1.1 references
├── probe_new_publishers.py      Exploratory: find new publishers to add
├── retag.py                     Re-run the auto-tagger on existing rows
├── scrape_ccdp.py               One-off: scrape CCDP catalogue
├── scrape_lics_refs.py          One-off: scrape LiCS reference lists
├── seed_usu_rhet_comp.py        One-off: seed USU Press records
├── weekly_maintenance.py        Wrapper invoked weekly on Fly
│
├── fetch_parlor.py              ┐
├── fetch_pitt.py                ├─ Per-press book scrapers, run on demand
├── fetch_routledge.py           │
├── fetch_siup.py                ┘
│
├── templates/                   Jinja2 templates (base.html → base-core.html)
├── static/                      style.css + theme variants + explore.js (D3)
├── data/seeds/                  Hand-curated ingestion inputs (see README there)
├── docs/                        Architecture notes, methodology, refactor notes
├── tests/                       Pytest harness (~ 320 tests)
│
├── conftest.py                  Test fixtures
├── pytest.ini                   Test config
├── requirements.txt             Runtime deps
├── requirements-dev.txt         Dev / test deps
├── Dockerfile                   Multi-stage build for Fly
├── fly.toml                     Fly deployment config (single-machine)
└── articles.db                  SQLite database (gitignored; on /data in prod)

Documentation

Document For
Architecture How Pinakes is built — system overview, data flow, ingestion, deployment
Methodology What each Explore tool measures, how it's computed, and the scholarly references
Journal coverage Every venue indexed by Pinakes with its ingestion path
Disaster recovery runbook Restoring from off-machine backups
Refactor notes Audit trail for the structural improvements (sessions A1 through G1)

Contributing

External contributions are rare; the project is maintained by one person, but bug reports and small PRs are welcome. See CONTRIBUTING.md for the testing requirements and the scraping-ethics review that applies to any new scraper.

License

Released under the GNU GPL 3.0.

Citation

A research note describing the index is in preparation for the Journal of Writing Analytics. Until that lands, please cite the repository:

@misc{lewis_pinakes_2026,
  author = {Lewis, Justin},
  title  = {Pinakes: A Bibliometric Index for Rhetoric and Composition},
  year   = {2026},
  url    = {https://github.com/justalewis/Rhet-Comp-Index},
}

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors