An R package for modeling health effects of environmental exposures, with a primary focus on groundwater contaminants, based on epidemiological studies and existing concentration models.
USGS IPDS Record Number: IP-180903
If you use geoExposeR in your research, please cite the software release:
Majumdar, S., Bartell, S. M., Lombard, M. A., Smith, R. G., and Gribble, M. O., 2026, geoExposeR: An R package for modeling health effects of environmental exposures: U.S. Geological Survey software release, https://doi.org/10.5066/P1JGUKMD
BibTeX:
@software{majumdar2026geoExposeR,
title={geoExposeR: An R package for modeling health effects of environmental exposures},
author={Majumdar, Sayantan and Bartell, Scott M and Lombard, Melissa A and Smith, Ryan G and Gribble, Matthew O},
year={2026},
publisher={U.S. Geological Survey},
note={U.S. Geological Survey Software Release},
doi={10.5066/P1JGUKMD},
url={https://doi.org/10.5066/P1JGUKMD}
}And the accompanying paper (under review at JOSS):
Majumdar, S., Bartell, S. M., Lombard, M. A., Smith, R. G., and Gribble, M. O., 2026, geoExposeR: An R package for modeling health effects of environmental exposures, Journal of Open Source Software (under review).
BibTeX:
@article{majumdar2026geoExposeRpaper,
title={geoExposeR: An R package for modeling health effects of environmental exposures},
author={Majumdar, Sayantan and Bartell, Scott M and Lombard, Melissa A and Smith, Ryan G and Gribble, Matthew O},
journal={Journal of Open Source Software},
year={2026},
note={Under review},
url={https://code.usgs.gov/envirohealth/geoExposeR}
}- Statement of Need
- Workflow
- Installation
- Quick Start
- Examples
- Package Structure
- Core Functions
- Key Features
- Documentation
- Data Requirements
- Data Preparation Workflow
- Running Tests
- Modifying the R Package
- Contributing
- Troubleshooting
- License
Environmental epidemiologists face a fundamental challenge: exposure to environmental contaminants is rarely measured directly for study participants. Instead, researchers must estimate exposures using predictive models, monitoring data, or biomarkers—each with inherent uncertainty. Failing to account for this uncertainty can lead to biased effect estimates and misleading conclusions about health risks (Majumdar and others, 2026).
This challenge is particularly acute for drinking water contaminants like arsenic, where an estimated 40 million Americans rely on unregulated private wells. Foundational studies developed sophisticated methods to address exposure uncertainty by combining U.S. Geological Survey (USGS) machine learning predictions with U.S. Environmental Protection Agency (EPA) monitoring data through population-weighted integration and multiple imputation. However, implementing these methods requires specialized expertise in geospatial analysis, statistical modeling, and R programming—presenting a significant barrier for many researchers.
geoExposeR directly addresses this need by packaging the complete analytical pipeline into accessible R functions. The package automates:
- Integration of probabilistic exposure estimates from multiple data sources
- Population-weighted combination of exposure probabilities
- Multiple imputation to propagate exposure uncertainty
- Mixed-effects regression modeling with proper clustering adjustment
- Result pooling via Rubin's Rules with appropriate variance estimation
While designed for drinking water arsenic exposure assessments, the underlying statistical framework is broadly applicable to other environmental contaminants (nitrate, uranium, manganese, PFAS) and air pollutants (PM2.5, ozone) where probabilistic exposure estimates and health outcome data are available. The package includes worked examples for nitrate (infant methemoglobinemia), uranium (kidney function), and PM2.5 (respiratory outcomes), demonstrating how the framework generalizes to other contaminants and health endpoints. See the Broader Applicability section and example scripts for guidance on adapting the package to other domains.
By lowering technical barriers to rigorous uncertainty quantification, geoExposeR promotes reproducibility and enables wider application of state-of-the-art methods in environmental epidemiology.
For a detailed description of the methodology, see Majumdar and others (2026).
The package combines two distinct types of input data using population-based weights (e.g., private well vs. public water usage) to create integrated exposure probability distributions:
Parameter Data Type Description Examples model_prob_csvModeled probabilistic estimates Multinomial probability distributions across concentration categories, typically derived from predictive models USGS machine learning predictions, satellite-derived estimates, hybrid spatial models conc_data_csvMeasured concentrations Point measurements or summary statistics of contaminant concentrations from monitoring stations or laboratory analyses EPA public water system monitoring, ground-based air quality monitors, field sampling data Any dataset that provides modeled probabilities can serve as the probability input (
model_prob_csv), and any dataset that provides measured concentrations can serve as the concentration input (conc_data_csv), regardless of the originating agency.
The figure below illustrates the complete analytical workflow implemented in geoExposeR. The pipeline integrates four input data sources through data ingestion, probability integration, multiple imputation, and statistical modeling stages to produce pooled estimates with proper uncertainty quantification.
geoExposeR analytical workflow showing the data processing pipeline from input sources through statistical modeling to pooled estimates. Packages used at each stage are indicated in brackets. Pooled estimates include regression coefficients (β), standard errors (SE), 95% confidence intervals (CI), and p-values, all computed using Rubin's Rules. Adapted from Majumdar and others (2026).
| Category | Features |
|---|---|
| Data Ingestion | Process GeoTIFF rasters, concentration measurements, health data; geographic subsetting; input validation |
| Exposure Modeling | Probability/concentration data integration; population-weighted averaging; lognormal parameter estimation |
| Statistical Analysis | Multiple imputation (MICE); mixed-effects regression; Rubin's Rules pooling |
| Quality Assurance | 100% test coverage; comprehensive input validation; reproducible workflows |
Existing methods from studies like Bulka and others (2022) and Lombard and others (2021) provide robust statistical frameworks but require complex implementation involving:
- Integration of multiple probabilistic models (modeled probabilities and measured concentrations)
- Multiple imputation techniques for uncertain exposures
- Mixed-effects regression with proper pooling of results
geoExposeR solves this by packaging these sophisticated methods into user-friendly functions. This enables researchers to:
- 🔬 Focus on science, not statistical programming
- 📊 Ensure reproducibility across studies
- ⚡ Accelerate research on environmental health effects
- 🎯 Apply best practices for uncertainty quantification
- 🗺️ Work with multiple data formats (GeoTIFF, CSV, shapefiles)
By democratizing access to these methods, geoExposeR promotes more rigorous and comparable environmental exposure research.
To build the package vignettes, which provide detailed documentation and examples, you need to have Pandoc installed. RStudio bundles a version of Pandoc, so if you are an RStudio user, you may not need to install it separately.
If you do not have Pandoc, you can download and install it from the official website: pandoc.org/installing.html.
Prerequisites: R (≥ 4.4.0) and Pandoc for building vignettes.
Install from USGS GitLab:
Note: Installation using
devtools::install_gitlab()does not currently work while the repository is private. It will work after the repository is approved and publicly available, with therefargument pointing to a tagged release version.
# Install devtools if not already installed
if (!require("devtools")) install.packages("devtools")
# Install geoExposeR from USGS GitLab (specific tag or branch)
devtools::install_gitlab(
repo = "envirohealth/geoExposeR@v1.0.0", # Replace with desired tag or branch
host = "code.usgs.gov",
build_vignettes = TRUE
)
# Load the package
library(geoExposeR)For development and testing (includes coverage tools):
# Install additional development dependencies
install.packages(c("devtools", "testthat", "covr", "DT", "htmltools"))
# Install geoExposeR from USGS GitLab (specific tag or branch)
devtools::install_gitlab(
repo = "envirohealth/geoExposeR@main", # Or use a specific tag like @v1.0.0
host = "code.usgs.gov",
build_vignettes = TRUE
)Verify installation:
# Check package version
packageVersion("geoExposeR")
# View help
?perform_sensitivity_analysisPackage Documentation Website:
The complete package documentation, including detailed function references, vignettes, and examples, is available at: https://envirohealth.code-pages.usgs.gov/geoExposeR/
This website is automatically generated and updated whenever changes are made to the main branch, ensuring you always have access to the latest documentation.
If you downloaded the repository as a ZIP file (e.g., from GitLab or GitHub), you can install the package locally:
-
Extract the ZIP file to a directory of your choice.
-
Install from the extracted folder:
# Install devtools if not already installed
if (!require("devtools")) install.packages("devtools")
# Install from the local directory
# Replace the path with the actual path to the extracted folder
devtools::install_local(
path = "/path/to/geoExposeR", # Path to the extracted package folder
build_vignettes = TRUE,
dependencies = TRUE
)
# Load the package
library(geoExposeR)Alternative: Install from the ZIP file directly (without extracting):
# Install directly from the ZIP file
devtools::install_local(
path = "/path/to/geoExposeR.zip", # Path to the downloaded ZIP file
build_vignettes = TRUE,
dependencies = TRUE
)Using RStudio:
You can also install the package using RStudio's GUI:
- Go to Tools > Install Packages...
- In the "Install from:" dropdown, select Package Archive File (.zip; .tar.gz)
- Click Browse and navigate to the downloaded ZIP or tarball file
- Click Install
Note: Building vignettes requires Pandoc. If you encounter issues, you can skip vignette building by setting
build_vignettes = FALSE.
If you want to customize this package, run the following using R after you have modified the codes.
# Set the working directory to your package directory
setwd(<path to geoExposeR>)
# Load devtools
library(devtools)
# Generate documentation
document()
# Build the package
build()
# Install the package
install()
library(geoExposeR)
# Run sensitivity analysis with your data
results <- perform_sensitivity_analysis(
ndraws = 10,
model_prob_csv = "path/to/prob_data.csv",
conc_data_csv = "path/to/conc_data.csv",
birth_data_txt = "path/to/birth_data.txt",
regression_formula = "~ ExposureLevel + MAGE_R + rural + (1|FIPS)",
output_dir = "results/",
targets = c("OEGEST", "BWT")
)
# View results
print(results)Using Raw Concentration Data
If your concentration data contains raw values instead of pre-calculated lognormal parameters, specify the concentration column and the package will automatically convert them:
results <- perform_sensitivity_analysis(
ndraws = 10,
model_prob_csv = "path/to/prob_data.csv",
conc_data_csv = "path/to/conc_raw.csv", # File with raw concentrations
conc_raw_col = "concentration", # Column with raw values
default_sdlog = 1.0, # Default sdlog value
birth_data_txt = "path/to/birth_data.txt",
regression_formula = "~ ExposureLevel + MAGE_R + rural + (1|FIPS)",
output_dir = "results/",
targets = c("OEGEST", "BWT")
)For detailed examples with dummy data, see the vignette.
The package includes multiple working examples in the examples/ folder that demonstrate different workflows for using geoExposeR.
| Example File | Description |
|---|---|
run_geoExposeR_example.R |
Basic example using pre-formatted CSV data |
run_geoExposeR_from_geotiff.R |
Complete GeoTIFF workflow from rasters to analysis |
run_geoExposeR_data_pipeline.R |
Demonstrates all data ingestion functions |
These examples demonstrate how geoExposeR can be adapted for other environmental contaminants and exposures:
| Example File | Contaminant | Health Outcome | Description |
|---|---|---|---|
run_geoExposeR_nitrate.R |
Nitrate | Methemoglobinemia | Groundwater nitrate effects on infant health |
run_geoExposeR_uranium.R |
Uranium | Kidney function (eGFR) | Groundwater uranium nephrotoxicity |
run_geoExposeR_pm25.R |
PM2.5 | Respiratory function (FEV1) | Air quality effects on lung function |
# From the geoExposeR directory
cd /path/to/geoExposeR
# Basic example (CSV workflow)
Rscript examples/run_geoExposeR_example.R
# GeoTIFF workflow example
Rscript examples/run_geoExposeR_from_geotiff.R
# Data pipeline example (all ingestion functions)
Rscript examples/run_geoExposeR_data_pipeline.R
# Broader application examples
Rscript examples/run_geoExposeR_nitrate.R
Rscript examples/run_geoExposeR_uranium.R
Rscript examples/run_geoExposeR_pm25.ROr in R/RStudio:
# Set working directory to package root
setwd("/path/to/geoExposeR")
# Run basic example
source("examples/run_geoExposeR_example.R")
# Run GeoTIFF workflow
source("examples/run_geoExposeR_from_geotiff.R")
# Run data pipeline demo
source("examples/run_geoExposeR_data_pipeline.R")
# Run broader application examples
source("examples/run_geoExposeR_nitrate.R")
source("examples/run_geoExposeR_uranium.R")
source("examples/run_geoExposeR_pm25.R")The examples use data files from the examples/input_data/ folder:
| File | Description |
|---|---|
prob_dom.csv |
Modeled multinomial probabilities of arsenic levels (50 rows) |
conc_dom.csv |
Measured arsenic concentration data (50 rows) |
Demographics_dom.txt |
Demographics data with health outcomes - SBP (50 rows) |
| File | Description |
|---|---|
prob_nitrate.csv |
Modeled nitrate probability data for infant health example (50 rows) |
conc_nitrate.csv |
Measured nitrate concentration data (50 rows) |
Health_nitrate.txt |
Infant methemoglobin and birth weight data (50 rows) |
prob_uranium.csv |
Modeled uranium probability data for kidney function example (50 rows) |
conc_uranium.csv |
Measured uranium concentration data (50 rows) |
Health_uranium.txt |
Kidney function (eGFR) and demographic data (50 rows) |
prob_pm25.csv |
Modeled PM2.5 probability estimates (50 rows) |
conc_pm25.csv |
Measured PM2.5 concentration data (50 rows) |
Health_pm25.txt |
Respiratory function (FEV1) and smoking data (50 rows) |
⚠️ Note on Example Data: All example data files are simulated dummy data created for demonstration and testing purposes only. These files contain 50 synthetic records with realistic column structures but randomly generated values. The actual research data cannot be shared publicly due to data use agreements and privacy restrictions. The dummy data preserves the structure and format of the real data to allow users to understand the workflow and test the package functionality.
Demonstrates the standard analysis workflow using pre-formatted CSV data:
- Data Loading: Using custom column names to work with original data formats
- Concentration Data Preprocessing: Converting raw concentrations to lognormal parameters
- Custom Column Mapping: Specifying non-standard column names via function parameters
- Sensitivity Analysis: Running multiple imputation with mixed-effects regression
- Result Interpretation: Understanding the pooled estimates and confidence intervals
Demonstrates processing raw GeoTIFF raster data through the complete analysis pipeline:
- Creating GeoTIFF Rasters: Generating mock probability surfaces (P(As > 5), P(As > 10))
- Creating Boundaries: Building geographic polygons for extraction
- Extracting Probabilities: Using
prepare_prob_data()to extract probabilities by geographic unit - Processing Concentration Data: Using
prepare_conc_data()to compute lognormal parameters - Formatting Health Data: Using
format_health_data()to standardize outcomes - Validating Inputs: Using
validate_prepared_inputs()for quality control - Running Analysis: Executing the full sensitivity analysis
Comprehensively demonstrates all data ingestion functions:
prepare_conc_data(): Raw vs. summary concentration input types, population weightingformat_health_data(): Identifier formatting, multiple outcome supportadd_well_percentages(): Merging private well population datasubset_by_geography(): Filtering by state FIPS codes or GEOID listsvalidate_prepared_inputs(): Input file validation
- Using
id_col,prob_cols,pop_well_colfor probability data mapping - Using
conc_mean_col,conc_sd_col,pwell_colfor concentration data mapping - Using
record_id_col,exposure_level_colfor health data mapping - Setting appropriate
conc_cutoffsfor concentration categorization - Configuring MICE imputation parameters (
mice_m,mice_maxit,mice_method) - Processing GeoTIFF rasters with
terraandsfpackages - Geographic subsetting by state or custom GEOID lists
The geoExposeR package follows the standard structure for R packages:
geoExposeR/
├── R/ # Source code
│ ├── geoExposeR.R # Main exported function
│ ├── data-loading.R # Data loading and processing
│ ├── data-ingestion.R # Raw data preparation (GeoTIFF, concentration, health)
│ ├── imputation.R # Multiple imputation functions
│ ├── regression.R # Regression analysis and pooling
│ └── utils.R # Validation and utility functions
├── man/ # Generated documentation (.Rd files)
├── examples/
│ ├── run_geoExposeR_example.R # Basic CSV workflow example
│ ├── run_geoExposeR_from_geotiff.R # GeoTIFF raster workflow
│ ├── run_geoExposeR_data_pipeline.R # Data ingestion functions demo
│ ├── run_geoExposeR_nitrate.R # Nitrate exposure example
│ ├── run_geoExposeR_uranium.R # Uranium exposure example
│ ├── run_geoExposeR_pm25.R # PM2.5 air quality example
│ ├── input_data/ # Example input data files (all dummy data*)
│ │ ├── prob_dom.csv # Sample arsenic probability data (dummy*)
│ │ ├── conc_dom.csv # Sample arsenic concentration data (dummy*)
│ │ ├── Demographics_dom.txt # Sample demographics/health data (dummy*)
│ │ ├── prob_nitrate.csv # Sample nitrate probability data (dummy*)
│ │ ├── conc_nitrate.csv # Sample nitrate concentration data (dummy*)
│ │ ├── Health_nitrate.txt # Sample infant health data (dummy*)
│ │ ├── prob_uranium.csv # Sample uranium probability data (dummy*)
│ │ ├── conc_uranium.csv # Sample uranium concentration data (dummy*)
│ │ ├── Health_uranium.txt # Sample kidney function data (dummy*)
│ │ ├── prob_pm25.csv # Sample PM2.5 probability data (dummy*)
│ │ ├── conc_pm25.csv # Sample PM2.5 concentration data (dummy*)
│ │ └── Health_pm25.txt # Sample respiratory health data (dummy*)
├── tests/
│ ├── testthat/
│ │ └── test-geoExposeR.R # Comprehensive test suite (476 tests)
│ └── testthat.R # Test runner
├── vignettes/
│ └── geoExposeR-vignette.Rmd # Package tutorial and examples
├── paper/ # JOSS paper files
│ ├── paper.md # Paper manuscript
│ ├── paper.bib # References
│ ├── generate_workflow_figure.R # Script to generate workflow diagram
│ ├── workflow.png # Workflow diagram (600 DPI)
├── docs/ # pkgdown generated website
├── .gitlab-ci.yml # GitLab CI/CD pipeline configuration
├── .gitignore # Git ignore patterns
├── .Rbuildignore # R build ignore patterns
├── _pkgdown.yml # pkgdown website configuration
├── code.json # USGS code.json metadata
├── CONTRIBUTING.md # Contribution guidelines
├── DESCRIPTION # Package metadata and dependencies
├── DISCLAIMER.md # USGS disclaimer
├── LICENSE.md # CC0 1.0 Universal license
├── NAMESPACE # Package exports and imports
├── NEWS.md # Change log and version history
└── README.md # This file
*Dummy data for demonstration; actual research data cannot be shared.
The package uses a modular architecture with functionality organized across specialized files:
geoExposeR.R: Main exported functionperform_sensitivity_analysis()data-loading.R: Data loading and probability matrix processingdata-ingestion.R: Functions to prepare data from raw sources (GeoTIFFs, concentration measurements, health data)imputation.R: Multiple imputation using MICE frameworkregression.R: Mixed-effects regression and Rubin's Rules poolingutils.R: Input validation and utility functions
This approach provides several advantages:
- Separation of concerns: Each file handles a specific domain
- Easier testing: Functions can be tested in isolation
- Better maintainability: Changes are localized to specific modules
- Clearer organization: Easy to find relevant code
The package includes comprehensive testing with 100% code coverage:
tests/testthat/test-geoExposeR.R: Main test suite with 476 tests (includes helper functions)tests/testthat.R: Standard R package test runner.gitlab-ci.yml: GitLab CI/CD pipeline with automated testing and coverage reporting
While users primarily interact with perform_sensitivity_analysis(), the package includes several internal functions organized by functionality:
Data Loading (data-loading.R):
load_and_process_exposure_data(): Orchestrates modeled probability and measured concentration data loadingload_prob_data(): Loads modeled probability dataconvert_conc_to_multinomial(): Converts concentration lognormal parameters to multinomial probabilitiescreate_weighted_prob_matrix(): Creates weighted combined probability matrixload_and_process_birth_data(): Loads and processes birth/health outcome data
Multiple Imputation (imputation.R):
impute_exposure(): Creates multiple imputed datasets with exposure levelsimpute_additional_variables(): Imputes missing covariates using MICEvalidate_imputed_datasets(): Validates imputation results
Regression Analysis (regression.R):
regression_analysis(): Performs mixed-effects regression on imputed datapool_estimates_by_term(): Pools regression estimates using Rubin's Rulespool_single_estimate(): Applies Rubin's Rules to individual parameters
Data Ingestion (data-ingestion.R):
prepare_prob_data(): Convert probability GeoTIFF rasters to CSV formatprepare_conc_data(): Process concentration data to lognormal parametersformat_health_data(): Format and validate health outcome dataadd_well_percentages(): Merge private well population percentagessubset_by_geography(): Filter data by state, region, or custom areavalidate_prepared_inputs(): Comprehensive validation of prepared files
Utilities (utils.R):
validate_all_inputs(): Input validation and error checkingformat_ids_func(): Smart identifier formatting with automatic type detection (zero-pads numeric FIPS codes, preserves alphanumeric IDs)- Additional helper functions for data validation and formatting
perform_sensitivity_analysis(): This is the main exported function of the package. It orchestrates the entire workflow from data loading through final analysis:
- Data Loading: Loads and processes modeled probability data, measured concentration parameters, and birth/health outcome data
- Probability Integration: Combines probability and concentration models using weighted averages based on private well usage
- Multiple Imputation: Creates multiple datasets with probabilistically assigned exposure levels
- Covariate Imputation: Optionally imputes missing values in additional covariates using MICE
- Statistical Analysis: Fits mixed-effects regression models to assess exposure-outcome relationships
- Results Pooling: Applies Rubin's Rules to pool results across imputed datasets
- Output Generation: Saves results and returns structured analysis output
The function is designed to handle the complex statistical methodology while providing a simple, user-friendly interface that requires minimal statistical programming expertise.
The package provides functions to help prepare input data from raw sources:
prepare_prob_data(): Extracts probabilities from GeoTIFF rasters and aggregates by geographic unit.
library(terra)
library(sf)
prob_data <- prepare_prob_data(
prob_rasters = list(
gt5 = "path/to/prob_gt5.tif",
gt10 = "path/to/prob_gt10.tif"
),
boundaries = county_boundaries,
id_col = "GEOID",
extraction_method = "mean",
cutoffs = c(5, 10),
output_file = "prob_data_output.csv"
)prepare_conc_data(): Processes concentration measurements and converts to lognormal parameters.
conc_data <- prepare_conc_data(
conc_data = raw_measurements,
id_col = "county_fips",
concentration_col = "arsenic_ug_L",
population_col = "pop_served",
input_type = "raw",
output_file = "conc_lognormal.csv"
)format_health_data(): Formats and validates health outcome data.
health_data <- format_health_data(
health_data = raw_health_data,
id_col = "FIPS",
outcome_cols = c("BWT", "OEGEST"),
covariate_cols = c("MAGE", "rural"),
output_file = "health_outcomes.txt"
)subset_by_geography(): Filters data to specific states or regions.
# Subset to California
ca_data <- subset_by_geography(
data = prob_data,
state_fips = "06"
)
# Subset to specific counties
selected <- subset_by_geography(
data = prob_data,
geoid_list = c("06001", "06003", "06005")
)validate_prepared_inputs(): Validates all prepared input files before analysis.
validation <- validate_prepared_inputs(
prob_file = "prob_data_output.csv",
conc_file = "conc_lognormal.csv",
health_file = "health_outcomes.txt"
)
if (validation$valid) {
message("All files ready for analysis!")
}While geoExposeR was designed for arsenic exposure assessment, the underlying statistical framework is fully adaptable to other environmental contaminants and exposures. The package's methodology—combining probabilistic exposure estimates with health outcome data through multiple imputation—applies to any exposure assessment problem with similar data structures.
| Domain | Contaminants | Example Health Outcomes |
|---|---|---|
| Groundwater | Arsenic, Nitrate, Uranium, Manganese, PFAS | Birth outcomes, kidney function, neurological effects |
| Air Quality | PM2.5, Ozone, Wildfire smoke, NO₂ | Respiratory function, cardiovascular disease, mortality |
| Soil/Dust | Lead, Heavy metals | Blood lead levels, cognitive development |
The package parameters map naturally to any two probability/concentration data sources:
| Package Parameter | Arsenic Example | Nitrate Example | PM2.5 Example |
|---|---|---|---|
model_prob_csv |
Machine learning predictions | Nitrate model predictions | Satellite/hybrid estimates |
conc_data_csv |
Public water monitoring data | PWS nitrate measurements | Ground monitor data |
conc_cutoffs |
c(5, 10) µg/L |
c(5, 10) mg/L NO₃-N |
c(9, 35) µg/m³ |
| Health outcome | Birth weight, SBP | Methemoglobin levels | FEV1 % predicted |
# Nitrate exposure analysis for infant methemoglobinemia
results <- perform_sensitivity_analysis(
ndraws = 10,
model_prob_csv = "nitrate_model_probabilities.csv",
conc_data_csv = "nitrate_pws_measurements.csv",
birth_data_txt = "infant_health_data.txt",
# Nitrate-specific cutoffs (mg/L NO3-N)
# <5 (safe), 5-10 (elevated), >=10 (exceeds MCL)
conc_cutoffs = c(5, 10),
# Auto-convert raw concentrations to lognormal parameters
conc_raw_col = "nitrate_concentration",
default_sdlog = 0.8,
regression_formula = "~ as.factor(ExposureLevel) + infant_age + (1|county)",
targets = c("methemoglobin_pct")
)# PM2.5 exposure analysis for respiratory function
results <- perform_sensitivity_analysis(
ndraws = 10,
# Satellite/model predictions as probability data
model_prob_csv = "pm25_satellite_probabilities.csv",
# Ground monitors as concentration data
conc_data_csv = "pm25_ground_monitors.csv",
birth_data_txt = "respiratory_health_data.txt",
# PM2.5-specific cutoffs (µg/m³)
# <9 (meets EPA annual standard), 9-35 (moderate), >=35 (exceeds 24-hr standard)
conc_cutoffs = c(9, 35),
conc_raw_col = "pm25_annual_mean",
default_sdlog = 0.5,
regression_formula = "~ as.factor(ExposureLevel) + age + (1|smoking_status)",
targets = c("FEV1_pct_predicted")
)- Prepare probability data as multinomial probabilities across concentration categories (
model_prob_csv) - Prepare concentration data with raw measurements or lognormal parameters (
conc_data_csv) - Set appropriate cutoffs via
conc_cutoffsparameter based on health-relevant thresholds - Adjust regression formula for your health outcome and covariates
- Run the analysis — the statistical methodology remains identical
See the examples/ folder for complete working examples demonstrating nitrate, uranium, and PM2.5 analyses.
- ✅ Automated workflow: Single function handles entire analysis pipeline
- ✅ Multiple imputation: Accounts for uncertainty in exposure assignment
- ✅ Flexible modeling: Supports custom regression formulas and multiple outcomes
- ✅ Robust statistics: Implements Rubin's Rules for proper inference pooling
- ✅ Missing data handling: Optional MICE imputation for covariates
- ✅ GeoTIFF processing: Extract probabilities from raster data (
prepare_prob_data()) - ✅ Concentration data conversion: Transform measurements to lognormal parameters (
prepare_conc_data()) - ✅ Health data formatting: Standardize health outcome data formats (
format_health_data()) - ✅ Geographic subsetting: Filter data by state, region, or custom area (
subset_by_geography()) - ✅ Well data integration: Merge private well population percentages (
add_well_percentages()) - ✅ Input validation: Comprehensive validation of prepared files (
validate_prepared_inputs())
- ✅ Multi-source integration: Combines probability and concentration models with population weighting
- ✅ Spatial aggregation: Aggregate raster data by county or custom boundaries
- ✅ Population weighting: Weight exposures by private vs. public well usage
- ✅ Extensive testing: 100% code coverage with 476 automated tests
- ✅ Reproducible results: Seed control and comprehensive output saving
- ✅ CI/CD pipeline: Automated testing on multiple platforms (macOS, Linux, Windows)
- ✅ Documentation: Complete pkgdown website with examples and vignettes
📖 Complete documentation website: https://envirohealth.code-pages.usgs.gov/geoExposeR/
The documentation website includes:
- Function Reference: Detailed documentation for all package functions
- Vignettes: Step-by-step tutorials with working examples
- Getting Started Guide: Quick introduction to package usage
- Methodology: Statistical background and implementation details
- FAQ: Common questions and troubleshooting
# View package help
help(package = "geoExposeR")
# View main function documentation
?perform_sensitivity_analysis
# Browse vignettes
browseVignettes("geoExposeR")The package requires three input files:
- Modeled Probability Data (
model_prob_csv): CSV with concentration category probabilities and geographic/participant identifiers - Measured Concentration Data (
conc_data_csv): CSV with lognormal distribution parameters - Health Outcome Data (
birth_data_txt): Text file with health outcomes and matching identifiers
Flexible Identifier Support: The package works with any identifier format—FIPS codes, census tract IDs, ZIP codes, or participant-specific IDs. The only requirement is that identifiers match between your datasets.
Smart ID Formatting: The package automatically detects identifier types and applies appropriate formatting (enabled by default via format_ids = TRUE):
| ID Type | Example Input | Formatted Output | Behavior |
|---|---|---|---|
| Numeric FIPS (1-5 digits) | 1, 6001 |
"00001", "06001" |
Zero-padded to 5 digits |
| Census tracts (11 digits) | 6001001234 |
"6001001234" |
Converted to character only |
| Alphanumeric IDs | "PTD001", "CA-01" |
"PTD001", "CA-01" |
Preserved unchanged |
Set format_ids = FALSE to skip formatting and use identifiers as-is.
The probability data contains multinomial probabilities for concentration categories (e.g., derived from machine learning models such as Lombard, 2021).
| Column Name | Description | Required |
|---|---|---|
GEOID10 (or custom via id_col) |
Geographic identifier (FIPS, census tract, ZIP, etc.) | Yes |
prob_C1 (or custom via prob_cols) |
Probability of concentration < cutoff₁ (Category 1) | Yes* |
prob_C2 (or custom via prob_cols) |
Probability of cutoff₁ ≤ concentration < cutoff₂ (Category 2) | Yes* |
prob_C3 (or custom via prob_cols) |
Probability of concentration ≥ cutoff₂ (Category 3) | Yes* |
Wells_2010 (or custom via pop_well_col) |
Population using private wells in the geographic unit | Yes |
*Note on column names: The prob_C1, prob_C2, prob_C3 column names are package defaults. You can use custom column names by specifying them via the prob_cols parameter:
# Using custom column names
results <- perform_sensitivity_analysis(
prob_cols = c("prob_low", "prob_med", "prob_high"), # Your column names
# ... other parameters
)Category interpretation:
prob_C1(or first column): P(concentration < cutoff₁) — Low exposureprob_C2(or second column): P(cutoff₁ ≤ concentration < cutoff₂) — Moderate exposureprob_C3(or third column): P(concentration ≥ cutoff₂) — High exposure
The probabilities must sum to 1.0 for each row.
The concentration data provides measured contaminant concentration parameters.
| Column Name | Description | Required |
|---|---|---|
GEOID10 (or matching id_col) |
Geographic identifier matching probability data | Yes |
conc_meanlog (or custom via conc_mean_col) |
Log-scale mean of concentration | Yes** |
conc_sdlog (or custom via conc_sd_col) |
Log-scale standard deviation | No (uses default_sdlog if missing) |
PWELL_private_pct (or custom via pwell_col) |
Percentage of population using private wells (0-100) | Yes |
**Alternative: If you have raw concentrations instead of lognormal parameters, specify the concentration column via conc_raw_col and geoExposeR will automatically convert using: meanlog = log(concentration + 0.1).
The health data file can use any column names as long as they are correctly referenced in the regression_formula and other parameters. Column names are user-defined and flexible.
| Parameter | Default Column Name | Description | Can Be Customized? |
|---|---|---|---|
record_id_col |
"FIPS" |
Geographic/participant identifier matching exposure data | Yes |
targets |
c("OEGEST", "BWT") |
Health outcome columns for regression | Yes |
| Covariates | (user-defined) | Any columns referenced in regression_formula |
Yes |
Example with custom column names:
# Your health data has columns: participant_id, birthweight_g, gest_weeks, mom_age, smoking
results <- perform_sensitivity_analysis(
record_id_col = "participant_id", # Your ID column
targets = c("birthweight_g", "gest_weeks"), # Your outcome columns
regression_formula = "~ as.factor(ExposureLevel) + mom_age + smoking + (1|participant_id)",
# ... other parameters
)Key points:
- The
regression_formulamust reference actual column names in your health data - The identifier column specified by
record_id_colmust have matching values in the exposure data - All covariates in the formula must exist in the health data file
This section describes how to prepare input files for geoExposeR from raw data sources. This workflow is relevant for users who want to analyze environmental exposure for specific states, counties, aquifers, or custom geographic areas.
| Data Type | Primary Source | Format | Description |
|---|---|---|---|
| USGS Arsenic Probabilities | USGS ScienceBase | GeoTIFF | Predicted probabilities of arsenic exceeding thresholds in private wells |
| Domestic Well Locations | USGS ScienceBase | Various | Domestic well locations and populations served for 2000 and 2010 |
| EPA Public Water Data | Columbia University Drinking Water Dashboard (Ravalli and others, 2022) | Various | Arsenic concentrations in public water systems |
| Health Outcomes | CDC, State Health Departments, Research Data | CSV/TXT | Health outcomes linked to any matching identifier |
| Population Data | U.S. Census Bureau | Various | County-level population and water source statistics |
This step converts modeled probability rasters (e.g., GeoTIFF predictions) into the CSV format required by geoExposeR. The prepare_prob_data() function automates this process.
For arsenic, download probability GeoTIFFs from USGS ScienceBase:
Lombard, M.A., 2021, Data used to model and map arsenic concentration exceedances in private wells throughout the conterminous United States for human health studies: U.S. Geological Survey data release, https://doi.org/10.5066/P90RBJXS.
For other contaminants, any model-predicted probability rasters can be used (see Terminology).
The prepare_prob_data() function handles raster extraction, conversion from exceedance probabilities to multinomial probabilities, and output formatting:
library(terra)
library(sf)
# Load probability rasters and geographic boundaries
prob_rasters <- list(
gt5 = rast("path/to/prob_gt5.tif"), # P(concentration > 5)
gt10 = rast("path/to/prob_gt10.tif") # P(concentration > 10)
)
counties <- st_read("path/to/county_boundaries.shp")
# Extract and convert probabilities
prob_data <- prepare_prob_data(
prob_rasters = prob_rasters,
boundaries = counties,
id_col = "GEOID",
extraction_method = "mean", # "mean" for polygon averages, "centroid" for point extraction
cutoffs = c(5, 10),
output_file = "prob_data_output.csv"
)The function automatically:
- Extracts raster values for each geographic unit
- Converts exceedance probabilities to multinomial categories (e.g., P(As<5), P(5≤As<10), P(As≥10))
- Normalizes probabilities to sum to 1.0
- Saves the formatted output CSV
See run_geoExposeR_from_geotiff.R in the examples/ folder for a complete working example.
Use subset_by_geography() to filter data to specific states, regions, or custom areas:
# Filter to specific state (e.g., California)
ca_data <- subset_by_geography(data = prob_data, state_fips = "06")
# Filter to specific counties
selected <- subset_by_geography(
data = prob_data,
geoid_list = c("06001", "06003", "06005")
)This step processes measured concentration data (e.g., monitoring data from public water systems) into the format required by geoExposeR. The prepare_conc_data() function automates this process.
For arsenic, download monitoring data from:
- Columbia University Drinking Water Dashboard (Ravalli and others, 2022)
- EPA SDWIS Federal Reports
- EPA Envirofacts
- State environmental agency databases
For other contaminants, any measured concentration dataset can be used (see Terminology).
The prepare_conc_data() function handles aggregation, lognormal parameter estimation, and output formatting:
# Option A: From raw individual measurements
conc_data <- prepare_conc_data(
conc_data = raw_measurements,
id_col = "county_fips",
concentration_col = "arsenic_ug_L",
population_col = "pop_served",
input_type = "raw", # Calculates population-weighted lognormal params
output_file = "conc_lognormal.csv"
)
# Option B: From pre-computed summary statistics
conc_data <- prepare_conc_data(
conc_data = summary_stats,
id_col = "county_fips",
mean_col = "mean_concentration",
sd_col = "sd_concentration",
input_type = "summary",
output_file = "conc_lognormal.csv"
)See run_geoExposeR_data_pipeline.R in the examples/ folder for a complete working example.
Alternatively, you can skip manual preprocessing entirely. If your measured concentration data contains raw values, geoExposeR can automatically convert them to lognormal parameters at analysis time:
# Specify the concentration column - no preprocessing needed
results <- perform_sensitivity_analysis(
# ... other parameters ...
conc_data_csv = "conc_raw_data.csv", # Original file with concentrations
conc_raw_col = "arsenic_ppb", # Column with raw concentration values
default_sdlog = 1.0 # Default sdlog (optional, defaults to 1.0)
)The package automatically calculates:
meanlog = log(concentration + 0.1)(offset avoids log(0))sdlog=default_sdlog(or fromconc_sd_colif that column exists)
Health outcome data should include individual or aggregated health records with geographic identifiers.
| Source | Description | Geographic Level |
|---|---|---|
| CDC WONDER | Mortality and natality data | County |
| State Vital Records | Birth and death certificates | County/Tract |
| BRFSS | Behavioral risk factors | State/County |
| NHANES | National health survey | National (linkable) |
| Research Cohorts | Study-specific health data | Various |
# Health data should have:
# 1. Geographic or participant identifier (must match exposure data)
# 2. Health outcome variable(s)
# 3. Relevant covariates
# Example with county FIPS codes:
health_data <- data.frame(
FIPS = c("06001", "06003", ...), # 5-digit county FIPS
BWT = c(3200, 3150, ...), # Birth weight (grams)
OEGEST = c(39.2, 38.8, ...), # Gestational age (weeks)
MAGE = c(28, 32, ...), # Maternal age
rural = c(0, 1, ...), # Urban/rural indicator
# Additional covariates as needed
smoking = c(0, 0, ...),
education = c(3, 4, ...)
)
# Example with participant IDs (individual-level analysis):
individual_data <- data.frame(
participant_id = c("P001", "P002", ...), # Custom participant IDs
BWT = c(3200, 3150, ...),
arsenic_exposure = c(4.2, 8.1, ...), # Individual measurements
age = c(28, 32, ...)
)
# Ensure identifiers are properly formatted
health_data$FIPS <- sprintf("%05d", as.integer(health_data$FIPS))
write.table(health_data, "health_outcomes.txt", sep = "\t", row.names = FALSE)Private well usage percentages are critical for weighting probability vs. concentration data.
Data Citation:
Johnson, T.D., and Belitz, K., 2019, Domestic well locations and populations served in the contiguous U.S.: datasets for decadal years 2000 and 2010: U.S. Geological Survey data release, https://doi.org/10.5066/P9FSLU3B.
library(tidycensus)
# Get American Community Survey data on water source
# Table B25040: House Heating Fuel (proxy) or custom tabulations
acs_data <- get_acs(
geography = "county",
variables = c(
total_housing = "B25001_001",
# Note: Direct well water data may require special tabulations
),
year = 2020
)
# Alternative: Use estimates of private well usage
# Available from EPA, USGS, or state environmental agencies# Combine population data with probability and concentration datasets
final_prob <- prob_data %>%
left_join(population_data, by = "GEOID10") %>%
mutate(Wells_2010 = total_pop * private_well_pct)
final_conc <- conc_data %>%
left_join(population_data, by = "county_fips") %>%
mutate(PWELL_private_pct = private_well_pct * 100)Before running geoExposeR, verify your prepared data:
-
Probability Data (
model_prob_csv):- Probability columns sum to approximately 1.0 for each row
- No negative probabilities
- Geographic IDs match between all datasets
- Population weights are non-negative
-
Concentration Data (
conc_data_csv):-
meanlogvalues are reasonable (typically -2 to 4 for µg/L) -
sdlogvalues are positive (typically 0.5 to 2.0) - Private well percentages are between 0 and 100
-
-
Health Data:
- FIPS codes are properly formatted (5 digits)
- Outcome variables have appropriate ranges
- Missing values are coded consistently (NA)
Here's a complete example for preparing data for a single state:
# === Complete workflow for California ===
library(terra)
library(sf)
library(dplyr)
library(tidycensus)
# 1. Load probability rasters
prob_rasters <- c(
rast("as_prob_gt1.tif"),
rast("as_prob_gt5.tif"),
rast("as_prob_gt10.tif")
)
# 2. Get California county boundaries
ca_counties <- counties(state = "CA", cb = TRUE, year = 2020)
# 3. Extract probabilities and format for geoExposeR
prob_ca <- ca_counties %>%
mutate(
p_gt1 = exact_extract(prob_rasters[[1]], ., "mean"),
p_gt5 = exact_extract(prob_rasters[[2]], ., "mean"),
p_gt10 = exact_extract(prob_rasters[[3]], ., "mean"),
prob_C1 = 1 - p_gt5,
prob_C2 = p_gt5 - p_gt10,
prob_C3 = p_gt10
) %>%
st_drop_geometry() %>%
select(GEOID10 = GEOID, prob_C1, prob_C2, prob_C3)
# 4. Add well population (example - replace with actual data)
prob_ca$Wells_2010 <- runif(nrow(prob_ca), 1000, 50000)
# 5. Prepare concentration data (example structure)
conc_ca <- data.frame(
GEOID10 = prob_ca$GEOID10,
conc_meanlog = rnorm(nrow(prob_ca), 1, 0.5),
conc_sdlog = rep(1.0, nrow(prob_ca)),
PWELL_private_pct = runif(nrow(prob_ca), 5, 30)
)
# 6. Save prepared data
write.csv(prob_ca, "california_prob_data.csv", row.names = FALSE)
write.csv(conc_ca, "california_conc_data.csv", row.names = FALSE)
# 7. Run geoExposeR
library(geoExposeR)
results <- perform_sensitivity_analysis(
ndraws = 10,
model_prob_csv = "california_prob_data.csv",
conc_data_csv = "california_conc_data.csv",
birth_data_txt = "california_health_outcomes.txt",
regression_formula = "~ as.factor(ExposureLevel) + covariates + (1|FIPS)",
output_dir = "results/",
targets = c("health_outcome")
)- USGS Water Resources: https://www.usgs.gov/mission-areas/water-resources
- EPA Drinking Water Data: https://www.epa.gov/ground-water-and-drinking-water
- Census Data API: https://www.census.gov/data/developers.html
- tidycensus R Package: https://walker-data.com/tidycensus/
- terra R Package: https://rspatial.github.io/terra/
- exactextractr R Package: https://isciences.gitlab.io/exactextractr/
The package includes comprehensive tests to ensure reliability and correctness. All tests use synthetic dummy data and run quickly for efficient development and CI/CD workflows.
Run all tests:
cd /path/to/geoExposeR
Rscript -e "devtools::test()"Run specific test file:
Rscript -e "testthat::test_file('tests/testthat/test-geoExposeR.R')"Run tests during development (faster - no package rebuild):
Rscript -e "devtools::load_all(); devtools::test()"Run tests with clean output (suppress MICE warnings):
Rscript -e "suppressWarnings(devtools::test())"Check package integrity:
Rscript -e "devtools::check()"Interactive testing in R/RStudio:
# Install coverage dependencies if needed
if (!require("DT")) install.packages("DT")
if (!require("htmltools")) install.packages("htmltools")
# Load package for development
devtools::load_all()
# Run all tests
devtools::test()
# Run specific test with detailed output
testthat::test_file("tests/testthat/test-geoExposeR.R", reporter = "progress")
# Check test coverage (requires covr, DT and htmltools)
covr::report(covr::package_coverage())Tests are configured for speed and reliability:
- Fast execution: Uses
ndraws = 2for quick multiple imputation - Synthetic data: All tests use generated dummy data, no external dependencies
- Comprehensive coverage: Tests data loading, imputation, regression, and output generation
- Warning suppression: Expected MICE convergence warnings are suppressed for clean output
- Automatic cleanup: Temporary files are automatically removed after each test
A successful test run will show:
✔ | F W S OK | Context
⠹ | 3 | Test perform_sensitivity_analysis with comprehensive coverage
--- Pooled Analysis Results ---
$BWT
term q.mi se.mi statistic conf.low conf.high p.value
1 5-10 14.641932 39.77041 0.3681614 -72.92239 102.2063 0.7197590
2 10+ 7.061721 44.69140 0.1580107 -89.74262 103.8661 0.8769384
$OEGEST
term q.mi se.mi statistic conf.low conf.high p.value
1 5-10 -0.02647891 0.1420687 -0.1863810 -0.3189409 0.2659831 0.8536366
2 10+ -0.02328979 0.1665429 -0.1398426 -0.3652842 0.3187046 0.8898402
-----------------------------
[1] "Checking results for target: BWT"
[1] "Checking results for target: OEGEST"
✔ | 476 | Test perform_sensitivity_analysis with comprehensive coverage [11.4s]
══ Results ═════════════════════════════════════════════════════════════════════
Duration: 11.4 s
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 476 ]
The package includes automated testing via GitLab CI/CD:
- R CMD check: Validates package structure and dependencies
- Multiple R versions: Tests on R 4.4+ across different operating systems
- Dependency validation: Ensures all required packages are properly declared
- Documentation checks: Validates roxygen2 documentation completeness
The package maintains 100% test coverage across all source files:
| File | Coverage |
|---|---|
| R/data-ingestion.R | 100.00% |
| R/data-loading.R | 100.00% |
| R/geoExposeR.R | 100.00% |
| R/imputation.R | 100.00% |
| R/regression.R | 100.00% |
| R/utils.R | 100.00% |
| Overall | 100.00% |
Generate coverage report locally:
# Install coverage dependencies if needed
if (!require("covr")) install.packages("covr")
if (!require("DT")) install.packages("DT")
if (!require("htmltools")) install.packages("htmltools")
# Generate coverage report
library(covr)
cov <- package_coverage()
print(cov)
# Generate HTML report
report(cov, file = "coverage_report.html")If tests fail with data.table errors:
# Ensure data.table is properly loaded
Rscript -e "library(data.table); devtools::test()"If MICE warnings are excessive:
# Run with warning suppression
Rscript -e "suppressWarnings(devtools::test())"If tests timeout:
# Check system resources and reduce ndraws in test files if needed
# Default test configuration uses ndraws = 2 for speed
# For very slow systems, you can temporarily modify test parameters:
# Edit tests/testthat/test-geoExposeR.R and change ndraws = 2 to ndraws = 1
# Or run tests with extended timeout
Rscript -e "options(testthat.default_timeout = 600); devtools::test()"
# Skip slow integration tests and run only unit tests
Rscript -e "devtools::test(filter = 'unit')"Memory issues:
# Clear R environment and restart
Rscript -e "rm(list=ls()); gc(); devtools::test()"
# Run tests with reduced memory usage
Rscript -e "options(testthat.progress.max_fails = 10); devtools::test()"When developing or modifying the package:
- Run tests frequently during development
- Add new tests for new functionality
- Use small datasets in tests for speed (ndraws = 2)
- Test edge cases such as missing data or unusual parameter values
- Validate statistical correctness of pooled results
We welcome contributions! Please see our contribution guidelines for details.
Ways to contribute:
- 🐛 Report bugs via GitLab issues
- 💡 Suggest features or improvements
- 📖 Improve documentation
- 🧪 Add tests or examples
- 🔧 Submit merge requests
Development setup:
git clone https://code.usgs.gov/envirohealth/geoExposeR.git
cd geoExposeR
Rscript -e "devtools::load_all(); devtools::test()"Some users may encounter missing dependency errors when installing or using geoExposeR in RStudio, even after successful installation. This typically manifests as errors like:
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called 'R6'
Common missing packages:
- R6
- digest
- purrr
- vctrs
- glue
Solution: Install these core dependencies manually before installing geoExposeR:
# Install missing dependencies
install.packages(c("R6", "digest", "purrr", "vctrs", "glue"))
# Then install geoExposeR from USGS GitLab
devtools::install_gitlab(
repo = "envirohealth/geoExposeR@v1.0.0",
host = "code.usgs.gov"
)Why this happens: This issue typically occurs when:
- RStudio's package installation doesn't properly resolve all transitive dependencies
- System-level R and RStudio R installations differ
- Package library paths are misconfigured
- Previous incomplete installations left corrupted package states
Alternative solutions:
-
Restart R session and try again:
# In RStudio: Session -> Restart R .rs.restartR() -
Update all packages first:
update.packages(ask = FALSE)
-
Install from a clean state:
# Remove any partial installations remove.packages("geoExposeR") # Clear package cache .libPaths() # Check library paths # Reinstall dependencies and package install.packages(c("devtools", "R6", "digest", "purrr", "vctrs", "glue")) devtools::install_gitlab( repo = "envirohealth/geoExposeR@v1.0.0", host = "code.usgs.gov" )
-
Check R version compatibility:
# Ensure you're running R >= 4.4.0 R.version.string
If issues persist, please report them with your R version and sessionInfo().
This work was supported by the National Heart, Lung, and Blood Institute (R21HL159574) and funding from the United States Geological Survey’s John Wesley Powell Center for Analysis and Synthesis.
This project is in the public domain under the CC0 1.0 Universal license - see the LICENSE file for details.
Version: 1.0.0 | Status: Under active development
Bulka, C.M., Bryan, M.S., Lombard, M.A., Bartell, S.M., Jones, D.K., Bradley, P.M., Vieira, V.M., Silverman, D.T., Focazio, M., Toccalino, P.L., Daniel, J., Backer, L.C., Ayotte, J.D., Gribble, M.O., and Argos, M., 2022, Arsenic in private well water and birth outcomes in the United States: Environment International, v. 163, article 107176, https://doi.org/10.1016/j.envint.2022.107176.
Lombard, M.A., Bryan, M.S., Jones, D.K., Bulka, C., Bradley, P.M., Backer, L.C., Focazio, M.J., Silverman, D.T., Toccalino, P., Argos, M., Gribble, M.O., and Ayotte, J.D., 2021, Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies: Environmental Science & Technology, v. 55, no. 8, p. 5012–5023, https://doi.org/10.1021/acs.est.0c05239.
Nigra, A. E., Bloomquist, T. R., Rajeev, T., Burjak, M., Casey, J. A., Goin, D. E., Herbstman, J. B., Ornelas Van Horne, Y., Wylie, B. J., Cerna-Turoff, I., Braun, J. M., McArthur, K. L., Karagas, M. R., Ames, J. L., Sherris, A. R., Bulka, C. M., Padula, A. M., Howe, C. G., Fry, R. C., … Consortium, E. C., 2025, Public Water Arsenic and Birth Outcomes in the Environmental Influences on Child Health Outcomes Cohort, JAMA Network Open, v. 8, no. 6, p. e2514084–e2514084. https://doi.org/10.1001/jamanetworkopen.2025.14084
Angley, M., Zhang, Y., Nigra, A. E., Lombard, M. A., Gribble, M. O., Lu, L., Unverzagt, F. W., McClure, L. A., Judd, S. E., Cushman, M., Brockman, J., and Kahe, K., 2026, Drinking water arsenic, urinary arsenic biomarkers, and cognitive impairment in the REGARDS Study, Environmental Research, 123768, https://doi.org/10.1016/j.envres.2026.123768.
Rubin, D.B., 1987, Multiple imputation for nonresponse in surveys: New York, John Wiley & Sons, https://doi.org/10.1002/9780470316696.
van Buuren, S., and Groothuis-Oudshoorn, K., 2011, mice—Multivariate imputation by chained equations in R: Journal of Statistical Software, v. 45, no. 3, p. 1–67, https://doi.org/10.18637/jss.v045.i03.
Honaker, J., King, G., and Blackwell, M., 2011, Amelia II—A program for missing data: Journal of Statistical Software, v. 45, no. 7, p. 1–47, https://doi.org/10.18637/jss.v045.i07.
Bates, D., Mächler, M., Bolker, B., and Walker, S., 2015, Fitting linear mixed-effects models using lme4: Journal of Statistical Software, v. 67, no. 1, p. 1–48, https://doi.org/10.18637/jss.v067.i01.
Lombard, M.A., 2021, Data used to model and map arsenic concentration exceedances in private wells throughout the conterminous United States for human health studies: U.S. Geological Survey data release, https://doi.org/10.5066/P90RBJXS.
Johnson, T.D., and Belitz, K., 2019, Domestic well locations and populations served in the contiguous U.S.—Datasets for decadal years 2000 and 2010: U.S. Geological Survey data release, https://doi.org/10.5066/P9FSLU3B.
DeSimone, L.A., McMahon, P.B., and Rosen, M.R., 2014, The quality of our Nation's waters—Water quality in principal aquifers of the United States, 1991–2010: U.S. Geological Survey Circular 1360, 151 p., https://pubs.usgs.gov/circ/1360/.
Ravalli, F., Yu, Y., Bostick, B. C., Chillrud, S. N., Schilling, K., Basu, A., Navas-Acien, A., and Nigra, A. E., 2022, Sociodemographic inequalities in uranium and other metals in community water systems across the USA, 2006-11: a cross-sectional study: a cross-sectional study. The Lancet Planetary Health, 6(4), e320–e330, https://doi.org/10.1016/S2542-5196(22)00043-2
R Core Team, 2024, R: A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, https://www.R-project.org/.
Wickham, H., and Bryan, J., 2023, R packages (2nd ed.): Sebastopol, California, O'Reilly Media, https://r-pkgs.org/.
Barrett, T., Dowle, M., Srinivasan, A., Gorecki, J., Chirico, M., Hocking, T., Schwendinger, B., and Krylov, I., 2025, data.table—Extension of data.frame: R package version 1.17.99, https://r-datatable.com.
Hijmans, R.J., 2024, terra—Spatial data analysis: R package version 1.7-78, https://CRAN.R-project.org/package=terra.
Pebesma, E., and Bivand, R., 2024, sf—Simple features for R: R package version 1.0-16, https://CRAN.R-project.org/package=sf.
Baston, D., 2024, exactextractr—Fast extraction from raster datasets using polygons: R package version 0.10.0, https://CRAN.R-project.org/package=exactextractr.
