This is the code that generates scc.frontseat.org, a static site that contains summaries of upcoming meetings, agenda items, and legislative documents for the Seattle City Council.
The SCC publishes documents for upcoming meetings at seattle.legistar.com. These documents are often long and complex, and it can be difficult to understand what's going on. The Engage-o-tron™ is an attempt to make this information more accessible.
This repository contains code to generate a static website site that is deployed to GitHub Pages.
We use GitHub Actions to regularly re-crawl the SCC Legistar website, extract text from PDFs, generate summaries of upcoming meetings, and deploy the static site.
In order to make this all work, a record of previously crawled data, extracted text, and summarizations is stored in a SQLite database that is checked into this repository. This follows Simon Willison's "baked data" pattern and keeps our devops both simple and zero cost. The only expense we have, at the moment, is invoking ChatGPT to generate summaries; we're currently well under the GitHub Actions monthly minutes quota for the free tier.
NOTE This was written as a pretty quick hack. It's messy. Maybe one day I'll go back and clean it all up.
The primary GitHub action definition is crawl-summarize-deploy.yml, which runs a couple times a day. It invokes our crawler; extracts text from newly discovered PDFs and Word files; generates summaries; updates the SQLite database; and generates + deploys the static site.
The code is implemented as a Django project. We use Django's ORM to interact with the SQLite database. We use Django's view and template system to generate the HTML content. We use Django Distill to generate a static site from our Django views. (It's nice to have a flexible web framework!)
We have a collection of command-line tools implemented as Django management commands:
-
manage.py legistar crawl-calendar
crawls the Legistar calendar page, follows links for meeting and legislation sub-pages, grabs all document attachments, and updates the database. -
manage.py documents ...
provides tools to extract and summarize text from one or multiple documents. -
manage.py legistar summarize ...
provides tools to summarize one or multiple legislative actions or meetings. -
manage.py distill-local ...
builds the static site into thedist
directory.
For every Document
, Legislation
(aka legislative agenda item in a meeting), and Meeting
, we generate a summary record that includes:
- A short newspaper-style
headline
- A longer
body
summary - A small amount of bookkeeping/debugging data so we can better evaluate our summaries
The base model for all summarizations is found in server/lib/summary_model.py
.
Most of SCC's documents are large PDFs or Word files. They very rarely fit into the token context window of the LLMs that we use to generate summaries.As a result, we segment documents into smaller chunks and summarize each chunk individually. We then concatenate the summaries together to form a single summary for the entire document. (If the first round of summaries itself does not fit into the token context window, we repeat the process.)
Right now, text extraction from PDFs uses pdfplumber. This works acceptably well for a first cut, but it fails to extract text from bitmaps found in documents. (Thankfully, most but not all SCC PDFs contain actual text, not scans.) Something like Google Cloud Vision API would likely provide vastly better extraction results.
Currently we're pretty stupid about how we split our documents into chunks. We look for obvious boundaries; if those don't provide small enough chunks, we look for newlines and sentence ends. We should be much smarter here: most SCC documents have obvious hierarchical structure, and we probably should split them into chunks based on that structure in the future.
The system has a notion of a style
for a summary. A "style" encompasses a number of things:
- The prompts used to generate summaries.
- The specific LLM used to generate summaries, and its parameters.
When I first built the site, I had close to a dozen styles, experimenting with wildly different prompts, and with different LLMs, including ChatGPT-3.5-Turbo, Vicuna 13B, and RedPajama 3B. After a lot of experimentation, I boiled things down to just a single style, called concise
, which uses ChatGPT-3.5-Turbo
and attempts to generate a neutral-voice summary that is as short as possible while still being informative.
One last point about summarization: I decided to use LangChain since I'd never used it before. Alas, I don't think I'll use it again. At is heart, LangChain contains a set of tools for flexibly manipulating LLM prompts. It's a fine idea, but the actual implementation leaves much to be desired. LangChain's primitives feel like they should be composable but often aren't; the library suffers from type erasure, making it difficult to know what parameters are available for a given chain; the documentation is poor; there are key missing features — like counting actual tokens used — that make it of limited use. I'm glad I tried it, though!
This is a Python 3.11 project, using Django 4.2, LangChain 0.0.1xx, and Django Distill 3.
We use Poetry to manage dependencies.
To get started, create a virtual environment and install the dependencies:
poetry install
Then, copy the .env-sample
file to a .env
file and make any changes you'd like. (You can leave it as-is for now.)
Now you have a choice. You can either use whatever data is already present in the checked-in database (data/db.sqlite3
), or you can start with a fresh one. If you'd like to start fresh, delete the data/db.sqlite3
file and then run:
poetry run python manage.py migrate
Great; you should have a data/db.sqlite3
file. You're ready to go.
To crawl the Seattle Legistar website, run:
poetry run python manage.py legistar crawl-calendar --start today
You can run crawl-calendar
multiple times in a row safely; if it encounters a document already in the database, it merely moves on. Otherwise, it adds a record of the document to the database.
Next, extract and summarize the crawled documents, legislative agenda items, and meetings:
poetry run python manage.py documents extract all
poetry run python manage.py documents summarize all
poetry run python manage.py legistar summarize all-legislation
poetry run python manage.py legistar summarize all-meetings
From here, you can run the Django development server to see what you've got:
poetry run python manage.py runserver
Or you can build the static site into ./dist
:
poetry run python manage.py distill-local --force --collectstatic
All of the above commands are run by the GitHub actions workflow. For local development, there's also a convenient ./scripts/update-all.sh
script that runs all of the above commands in sequence.
This is a typical Django app, albeit with an atypical build-to-static-site deployment step.
Of interest:
server.lib.*
contains utility code used throughoutserver.documents.*
contains code for storing ourDocument
andDocumentSummary
state. Seeserver/documents/models.py
for the database model definitions,summarize.py
for the LangChain + OpenAI code to chunk and summarize individual documents,extract.py
for our current (poor) PDF text extraction code, andmanagement/commands/documents.py
for the Django management commands that run the extraction and summarization.server.legistar.*
contains code for storing meetings (Meeting
) and legislative agenda items (Legislation
), along with their summaries (MeetingSummary
,LegislationSummary
), code for crawlingseattle.legistar.com
(seeserver/legistar/lib/crawl.py
), code for summarizing meetings and legislations (seeserver/legistar/summarize/*.py
), and code to generate the static site (seeserver/legistar/urls.py
andserver/legistar/views.py
).