This project periodically checks various links on CMS's Provider Data Catalog (PDC) to ensure that all links are operational and all data is accessible.
THIS IS NOT AN OFFICIAL GOVERNMENT CODEBASE.
Check the Archives.md file to see the status summary and detailed reports of each data topic archive.
To see the datasets, go to Datasets README.
-
Link Validation: We check CMS PDC links to make sure they're not ghosting 👻 us.
-
Categorized Reports: We neatly categorize this info in
Archives.md
and associated dataset markdown files. I'm really proud of how this turned out. -
Dataset Analysis: We test archive files and all datasets on the PDC. The datasets are analyzed, and basic checks are performed on them.
-
Public Data Check: Uses the PDC API to find the archive and dataset files and then checks to make sure they exist.
-
Summarized Status: A quick glance at the top tells you if things are going smoothly or if there's trouble.
-
Detailed Dataset Reports: To view the dataset results, go to
datasets/README.md
where you can find links to various data topics. -
GitHub Actions: Automatically keeps things in check every time you push to the
main
branch. It also runs every three hours and sends notifications if there is a ❌ in the results file. -
Error and Performance Monitoring: Uses Sentry for error and performance monitoring. If you don't want to use it, just don't set
SENTRY_DSN
to any variable. -
Logging Levels: Select your log level by setting
LOG_LEVEL
. Options areerror
,warn
,info
,debug
,trace
.
The dataset reports are generated using a Rust module that performs the following tasks:
-
Fetching Datasets
- The module fetches a list of datasets from the PDC API.
- Datasets are deserialized into a
Dataset
struct.
-
Processing Datasets
- Each dataset is processed in parallel to improve efficiency.
- The module checks the status of the dataset's download URL and landing page.
-
Generating Reports
- The module constructs a markdown report for each dataset, including:
- Dataset details (e.g., ID, title, description, issued date, modified date, release date).
- File history and overview (e.g., filesize, row count, column count).
- Data integrity tests (e.g., column count consistency, header validation, encoding validation).
- Public access tests (e.g., status of PDC page, landing page, and direct download link).
- Reports are saved to the
datasets
directory.
- The module constructs a markdown report for each dataset, including:
-
Error Handling and Logging
- The module uses Sentry for error tracking and performance monitoring.
- Detailed logging is performed using the
tracing
crate.
The archive reports are generated using another Rust module that performs the following tasks:
-
Fetching Archives
- The module fetches a list of archive topics from the PDC API.
- Archive topics are deserialized into a
Topics
struct.
-
Processing Archives
- Each archive is processed to check the status of yearly and monthly archive URLs.
- The module generates a summary of the yearly and monthly archive checks.
-
Generating Reports
- The module constructs a markdown report for each archive topic, including:
- Archive details and statuses.
- Summary of yearly and monthly archive checks.
- Reports are saved to the
Archives.md
file.
- The module constructs a markdown report for each archive topic, including:
-
Error Handling and Logging
- The module uses Sentry for error tracking and performance monitoring.
- Detailed logging is performed using the
tracing
crate.
-
Clone the repo:
git clone https://github.com/TheBoatyMcBoatFace/good-pdc.git cd good-pdc
-
Run locally:
cargo run
-
Vibe Check:
Open up
Archives.md
and see if there are any ❌, Hint: those are bad -
Automate with GitHub Actions:
Push to
main
to run the bot thing. It also runs every three hours and sends notifications if there is a ❌ in the results file.
You're awesome for wanting to help (just saying). Here are some guidelines:
-
Open issues: If you find bugs or have cool ideas, open an issue. No issue = it doesn't exist.
-
Don't be a jerk: I am not afraid to use the ban 🔨. GitHub is the best social media platform, don't ruin it.
This is aggressively open-source under AGPL-3.0 license. Details in the LICENSE file.
- Where the data come from: Provider Data Catalog (PDC) API Docs. Yes, data are plural. That wasn't a typo.
- Sentry for error tracking and performance monitoring: Sentry Setup Guide
- GitHub Action for automated checks: GitHub Actions Documentation
This README was generated with AI because I'm tired and don't want to do the documenting part. Bite me