Description
User Story
As a research data analyst,
I want to maintain data about GTFS URLs in only one place
so that I don't need to update 2 databases and the pipeline is always in sync with the source of truth information
Acceptance Criteria
Given the decided-upon place to maintain GTFS URLs (which should probably be airtable)
When (data pipeline) the data pipeline decides which feeds to download and stores data about which feeds are associated with which agencies and services
When (GTFS Realtime Archiver) the GTFS Realtime archiver data pipeline decides which feeds to download
Then (data pipeline) the data pipeline should be using data from the decided-upon place to maintain GTFS URLs (which should probably be airtable)
Then (GTFS Realtime Archiver) the GTFS Realtime archiver should be collecting GTFS Realtime URLs to download feeds from the decided-upon place to maintain GTFS URLs (which should probably be airtable)
Not recommended alternative: keep agencies.yml and link existing ITP ID + Url Number to airtable data
There is currently a bunch of data in airtable about GTFS feeds that should serve as the ultimate source of truth for our list of active feeds. However, the airtable data is identified with a gtfs_dataset_id
whereas the current structure of data in the warehouse is identified by calitp_id
and url_number
. A solution should be figured out to determine how to link this data. Two possible solutions are as follows:
- Add the
gtfs_dataset_id
to each entry within the agencies.yml feed - Refactor most of the data warehouse and do some data migrations such that
gtfs_dataset_id
retires thecalitp_id
andurl_number
key.