-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Problem
SparseMatrixBuilder.build_matrix() currently creates a fresh Microsimulation for each of the 51 states, sets state_fips, then calculates every target variable. This means:
- Matrix build takes ~20 minutes on M4 Max (the calibration training itself is fast after that)
- Microsimulation and calibration are tightly coupled — changing calibration targets requires re-running expensive simulations
- Adding new target variables requires the full microsimulation stack at calibration time
- Cannot parallelize across states (serial
Microsimulationcreation)
Proposed solution
Separate the pipeline into two steps:
1. Precompute step (once, parallelizable)
For each state × household, compute all target variables and save to a single file:
# Shape: (n_states, n_households, n_variables) or flat DataFrame
# Variables: state_income_tax, snap, health_insurance_premiums, household_count, person_count, etc.
precomputed = {}
for state_fips in all_states:
sim = Microsimulation(dataset=stratified_cps)
sim.set_input("state_fips", 2024, np.full(n_hh, state_fips))
for var in target_variables:
precomputed[(state_fips, var)] = sim.calculate(var, 2024, map_to="household").values
# Save as HDF5 or parquetThis can be trivially parallelized (51 independent simulations). State is also somewhat arbitrary as a chunking dimension — we just need every unique combination of geographic constraints evaluated.
2. Matrix build step (fast, pure NumPy)
Read the precomputed file, apply constraint masks, build the sparse matrix. No Microsimulation import needed. Should take seconds, not minutes.
Benefits
- Matrix build drops from ~20 min to seconds
- Precomputed file can be cached on HuggingFace alongside
stratified_extended_cps.h5 - Adding new calibration targets = just add rows to the target_filter, no re-simulation
- Precompute step can run on GPU instances or be parallelized across workers
- Clear separation of concerns: microsimulation vs. optimization
Related
- Replace CD stacking with cloning + national block assignment #486 (replace CD stacking with cloning) — both simplify the calibration architecture
- Add state income tax revenue as calibration target #492 (state income tax targets) — motivated this since adding state_income_tax required the full matrix rebuild
Metadata
Metadata
Assignees
Labels
No labels