Skip to content

Precompute target variable values to decouple microsimulation from calibration #499

@MaxGhenis

Description

@MaxGhenis

Problem

SparseMatrixBuilder.build_matrix() currently creates a fresh Microsimulation for each of the 51 states, sets state_fips, then calculates every target variable. This means:

  • Matrix build takes ~20 minutes on M4 Max (the calibration training itself is fast after that)
  • Microsimulation and calibration are tightly coupled — changing calibration targets requires re-running expensive simulations
  • Adding new target variables requires the full microsimulation stack at calibration time
  • Cannot parallelize across states (serial Microsimulation creation)

Proposed solution

Separate the pipeline into two steps:

1. Precompute step (once, parallelizable)

For each state × household, compute all target variables and save to a single file:

# Shape: (n_states, n_households, n_variables) or flat DataFrame
# Variables: state_income_tax, snap, health_insurance_premiums, household_count, person_count, etc.
precomputed = {}
for state_fips in all_states:
    sim = Microsimulation(dataset=stratified_cps)
    sim.set_input("state_fips", 2024, np.full(n_hh, state_fips))
    for var in target_variables:
        precomputed[(state_fips, var)] = sim.calculate(var, 2024, map_to="household").values
# Save as HDF5 or parquet

This can be trivially parallelized (51 independent simulations). State is also somewhat arbitrary as a chunking dimension — we just need every unique combination of geographic constraints evaluated.

2. Matrix build step (fast, pure NumPy)

Read the precomputed file, apply constraint masks, build the sparse matrix. No Microsimulation import needed. Should take seconds, not minutes.

Benefits

  • Matrix build drops from ~20 min to seconds
  • Precomputed file can be cached on HuggingFace alongside stratified_extended_cps.h5
  • Adding new calibration targets = just add rows to the target_filter, no re-simulation
  • Precompute step can run on GPU instances or be parallelized across workers
  • Clear separation of concerns: microsimulation vs. optimization

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions