Skip to content

Latest commit

 

History

History
126 lines (98 loc) · 6.12 KB

README.md

File metadata and controls

126 lines (98 loc) · 6.12 KB

Dataset Link Check Results 📊

GitHub Repo stars GitHub forks

GitHub License Last Commit Link Checker

This section contains detailed link check results for various datasets on the CMS's Provider Data Catalog (PDC). Each dataset has its own report detailing the status of the links and the accessibility of the data.

Available Datasets

Here are the datasets we currently monitor:

  • DAC: Doctors and Clinicians
  • DF: Dialysis Facilities
  • HC: Hospice Care
  • HHS: Home Health Services
  • HOS: Hospitals
  • IRF: Inpatient Rehabilitation Facilities
  • LTCH: Long-Term Care Hospitals
  • NH: Nursing Homes Including Rehab Services
  • PPL: Physician Office Visit Costs
  • SUP: Supplier Directory

Template for Dataset Reports

Each dataset report follows a consistent template to provide a comprehensive overview of the dataset's status and details. Below is a description of the sections included in each dataset markdown file:

Dataset Report Structure

  1. Dataset Title

    • Brief description of the dataset and its scope.
    • Dataset ID: Unique identifier for the dataset.
    • Status: Current status of the dataset (e.g., ✅ for accessible, ❌ for issues).
  2. Dataset Details

    • File History: Detailed history of the dataset file, including creation, modification, release, and last checked dates.

      File History
      ActivityDescriptionDate
      Issued DateWhen the dataset was createdYYYY-MM-DD
      Modified DateWhen it was last modifiedYYYY-MM-DD
      Release DateWhen the dataset was made publicYYYY-MM-DD
      Last CheckedWhen this dataset was last testedYYYY-MM-DD
    • File Overview: Metrics related to the dataset file, such as filesize, row count, and column count.

      File Overview
      MetricResult
      Filesize0.0 MB
      Row Count55
      Column Count8
  3. Data Integrity Tests

    • Summary and results of basic data integrity tests, including column count consistency, header validation, and encoding validation.
      TestDescriptionResult
      Column Count ConsistencyVerify that all rows have the same number of columns.
      Header ValidationEnsure the CSV has a header row and all headers are unique and meaningful.
      Encoding ValidationVerify that the CSV file uses UTF-8 encoding.UTF-8
  4. Public Access Tests

Automated Checks

Our GitHub Actions workflow automatically runs these link checks every three hours and sends notifications if any issues are detected. You can view the latest workflow run results by clicking the badge above.

Contributing

If you notice any issues or have suggestions for additional datasets to monitor, please open an issue or submit a pull request. We appreciate your contributions!

How the Dataset Reports are Generated

The dataset reports are generated using a Rust module that performs the following tasks:

  1. Fetching Datasets

    • The module fetches a list of datasets from the PDC API.
    • Datasets are deserialized into a Dataset struct.
  2. Processing Datasets

    • Each dataset is processed in parallel to improve efficiency.
    • The module checks the status of the dataset's download URL and landing page.
  3. Generating Reports

    • The module constructs a markdown report for each dataset, including:
      • Dataset details (e.g., ID, title, description, issued date, modified date, release date).
      • File history and overview (e.g., filesize, row count, column count).
      • Data integrity tests (e.g., column count consistency, header validation, encoding validation).
      • Public access tests (e.g., status of PDC page, landing page, and direct download link).
    • Reports are saved to the datasets directory.
  4. Error Handling and Logging

    • The module uses Sentry for error tracking and performance monitoring.
    • Detailed logging is performed using the tracing crate.

This ensures that all datasets listed on the Provider Data Catalog are regularly tested for accessibility and data integrity, with results being documented in a consistent and transparent manner.

For more details on the implementation, refer to the source code.


This README was generated with AI because I'm tired and don't want to do the documenting part. Bite me