As a newbie who has to submit data to the CDC’s EPHT program, I was dismayed to find out that the documentation is buried under many layers inside their SharePoint. It is also highly fragmented, convoluted and in many cases, conflicts with itself.
My goal is to make this process easier and reproducible for myself, and others.
So who is this highly specific package for?
- Do you submit data to the CDC’s EPHT program?
- Do you use R? Or are interested in incorporating R into your workflow?
- Do you struggle with the CDC’s EPHT documenation and/or tooling?
- Do you want to make your submission process more reproducible?
If you answered yes to the first question and any of the others, then this package might be for you.
I think it’s important to state up front what this package doesn’t
do - and that is, it will not wrangle your data for you. There are a few
helpers, and and a whole slew of checks distiller
will run on your
data and metadata to ensure that everything is reasonably close to the
correct format for submission to the CDC’s EPHT program.
distiller
still expects your data to have specific variable names, and
to have the required variables for each type of data. However, if you’ve
ever wondered why the epht requires different variable names in a
different order for the same types of data, even for the same
disease you’ll be pleased to know that distiller takes care of the
facility-type-specific naming conventions and the ordering for you.
Users just need to bring the data and now they can spend less time
worrying about XML semantics and more time polishing their data
products.
disitller
is no replacement for the CDC EPHPT Test Submission
portal, however, creating the XML, and shuffling files around and then
dropping them into the portal and waiting an indeterminate amount of
time for feedback eats up time and is a pain. distiller
aims to
provide feedback on your data and metadata before you send it off to the
CDC. This way, you can fix any obvious issues before you sink 20+
minutes waiting to find out you forgot to replace your NA
’s with “U”.
distiller
contains the following core functions:
check_submission()
- a function that checks your data and metadata and provides quick feedbackmake_xml_document()
- a function that creates an xml document for submission based on your data and the metadata your provide it
distiller
also contains functions for:
- collapsing race and ethnicity values into the CDC’s required format
- converting month integers to 0-padded character strings
- return the proper health outcome identifier for a given content group identifier
- Starting from scratch? Most of the mini-functions that make up the two core ones are exposed to the user, so you can check your work in pieces as you make progress with your data wrangling
distiller
works for the following content group identifiers:
- AS-HOSP
- AS-ED
- CO-HOSP
- CO-ED
- MI-HOSP
- HEAT-HOSP
- HEAT-ED
- COPD-HOSP
- COPD-ED
distiller
expects the following variables in your data:
For every content group identifier:
- agegroup
- county
- sex
- ethnicity
- race
- health_outcome_id,
- monthly_count
- month
- year
For content group identifiers CO-HOSP and CO-ED, the above plus the following:
- fire_count
- nonfire_count
- unknown_count
You can install the development version of distiller from GitHub with:
# install.packages("pak")
pak::pak("asenetcky/distiller")
Here is a basic example of how to use it:
library(distiller)
# Take you already-wrangled data
# note the specific variable names
data <-
mtcars |>
dplyr::rename(
month = mpg,
agegroup = cyl,
county = disp,
ethnicity = hp,
health_outcome_id = drat,
monthly_count = wt,
race = qsec,
sex = vs,
year = am
) |>
dplyr::select(-c(gear, carb))
# And your metadata
content_group_id <- "AS-HOSP"
mcn <- "1234-1234-1234-1234-1234"
jurisdiction_code <- "two_letter_code"
state_fips_code <- "1234"
submitter_email <- "submitter@email.com"
submitter_name <- "Submitter Name"
submitter_title <- "Submitter Title"
# Optionally check your submission data structure and metadata
check_submission(
data,
content_group_id,
mcn,
jurisdiction_code,
state_fips_code,
submitter_email,
submitter_name,
submitter_title
)
#> ℹ Checking submission metadata
#> ✔ Success: content_group_id
#> ! Warning: mcn may not have correct format
#> Troublemakers: length, format
#> ! Warning: jurisdiction_code may not have correct format
#> Troublemakers: length, format
#> ! Warning: state_fips_code may not have correct format
#> Troublemakers: length, format
#> ✔ Success: submitter_email
#> ✔ Success: submitter_name
#> ✔ Success: submitter_title
#> ℹ Checking data structure and content
#> ✔ Success: dataframe_structure
#> ✖ Danger: month does not have allowable value/s
#> Troublemakers: allowed_values
#> ✔ Success: agegroup
#> ✖ Danger: county does not have allowable value/s
#> Troublemakers: length
#> ✖ Danger: ethnicity does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: health_outcome_id does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: sex does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: year does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: race does not have allowable value/s
#> Troublemakers: allowed_values
#> ✔ Success: monthly_count
# This can also be checked with `check_first = TRUE` in `make_xml_document()`
# And then make your xml document
make_xml_document(
data,
content_group_id,
mcn,
jurisdiction_code,
state_fips_code,
submitter_email,
submitter_name,
submitter_title
)
#> {xml_document}
#> <HospitalizationData schemaLocation="http://www.ephtn.org/NCDM/PH/HospitalizationData ephtn-ph-HospitalizationData.xsd" xmlns="http://www.ephtn.org/NCDM/PH/HospitalizationData" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
#> [1] <Header>\n <MCN>1234-1234-1234-1234-1234</MCN>\n <JurisdictionCode>two_ ...
#> [2] <Dataset>\n <Row>\n <RowIdentifier>1</RowIdentifier>\n <AdmissionM ...