Skip to content

State calibration (policy_data.db) uses stale 2022-2023 targets for 2024 sim #503

@MaxGhenis

Description

@MaxGhenis

Summary

The L0/CD calibration system (policy_data.db) uses targets from 2022 IRS SOI actuals and 2023 CBO projections, but the simulation runs at time_period=2024. Meanwhile, the national ECPS calibration (loss.py) correctly reads 2024-indexed values from the CBO/Treasury YAML parameters. This means the state .h5 files are calibrated to the wrong aggregate totals.

Evidence

Comparing the DB targets (used by state/CD calibration) vs the YAML parameter values (used by national calibration):

Variable DB Target DB Source/Year YAML (2024) YAML Source Delta
income_tax $2,051B IRS SOI actual, 2022 $2,426B CBO projection, 2024 -18%
social_security $1,379B CBO projection, 2023 $1,454B CBO projection, 2024 -5%
snap $107B CBO projection, 2023 $94B CBO projection, 2024 +14%
ssi $60.1B CBO projection, 2023 $57B CBO projection, 2024 +5%
eitc $64.4B Treasury, 2023 $67.3B Treasury, 2024 -4%
refundable_ctc $33.1B IRS SOI actual, 2022 not targeted
unemployment_compensation $35B CBO, 2023 $34.7B CBO, 2024 +1%

Income tax is the most significant discrepancy: the DB uses $2,051B (2022 SOI actual) while the correct 2024 CBO projection is $2,426B — an 18% gap.

How to verify

# DB targets (L0/CD calibration)
import sqlite3
conn = sqlite3.connect("policyengine_us_data/storage/calibration/policy_data.db")
cur = conn.execute("""
    SELECT variable, period, value FROM targets 
    WHERE variable IN ('income_tax','social_security','snap','ssi','eitc','refundable_ctc','unemployment_compensation')
      AND active = 1
      AND stratum_id NOT IN (SELECT stratum_id FROM stratum_constraints 
                             WHERE constraint_variable IN ('congressional_district_geoid','state_fips'))
    ORDER BY variable, period
""")
for row in cur: print(row)

# YAML parameters (national calibration) 
from policyengine_us import Microsimulation
sim = Microsimulation()
params = sim.tax_benefit_system.parameters
for var in ['income_tax','snap','social_security','ssi','unemployment_compensation']:
    print(var, params(2024).calibration.gov.cbo._children[var])
print('eitc', params(2024).calibration.gov.treasury.tax_expenditures.eitc)

Impact on stacked state aggregates

We compared stacked-state totals (summing all 51 state .h5 files) against both target sets:

Variable Stacked States vs DB Target vs YAML (2024)
income_tax $2,196B 107% 90%
social_security $1,282B 93% 88%
snap $97B 91% 103%
eitc $62.7B 97% 93%

The stacked states overshoot the DB income_tax target by 7% (because the DB target is too low), but undershoot the correct 2024 value by 10%.

Root cause

  • fit_calibration_weights.py runs at time_period = 2024 (line 83)
  • SparseMatrixBuilder calculates variables at self.time_period (2024) to build the loss matrix
  • But the target values in policy_data.db were populated from 2022 SOI and 2023 CBO data and never updated
  • The period column in the targets table is metadata only — not used to select the correct target year

In contrast, loss.py dynamically reads sim.tax_benefit_system.parameters(time_period).calibration.gov.cbo._children[variable_name], which correctly resolves to the 2024 YAML value.

Proposed fix

Update policy_data.db national targets to use 2024 values from the same YAML parameters that loss.py uses. This could be:

  1. Quick fix: SQL UPDATE to set correct 2024 values for the ~7 affected national targets
  2. Structural fix: Have the DB ETL read from the YAML parameters (like loss.py does) so they stay in sync automatically. The loss.py comment at line 12-14 already notes this: "A future PR should wire build_loss_matrix() to read from the database so this dict can be deleted."

Option 2 is preferred since it prevents future drift. The ETL that populates policy_data.db should call sim.tax_benefit_system.parameters(2024).calibration.gov.cbo._children[var] for CBO programs and parameters(2024).calibration.gov.treasury.tax_expenditures.eitc for EITC.

Files involved

  • DB: policyengine_us_data/storage/calibration/policy_data.db (targets table)
  • DB ETL: policyengine_us_data/db/ (populates targets)
  • L0 calibration: policyengine_us_data/datasets/cps/local_area_calibration/fit_calibration_weights.py
  • Sparse matrix builder: policyengine_us_data/datasets/cps/local_area_calibration/sparse_matrix_builder.py
  • National calibration (reference): policyengine_us_data/utils/loss.py
  • CBO YAML params: installed at policyengine_us/parameters/calibration/gov/cbo/*.yaml
  • Treasury EITC YAML: policyengine_us/parameters/calibration/gov/treasury/tax_expenditures/eitc.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions