Skip to content

Conversation

@vahid-ahmadi
Copy link
Contributor

@vahid-ahmadi vahid-ahmadi commented Aug 12, 2025

Fixes #4

ONS Data Usage:

  1. Base Structure Generation (Step 1):
  • Uses all 88 SIC sectors from ONS with their exact firm counts
  • Uses ONS turnover bands (0-49k, 50-99k, 100-249k, etc.) to generate 2,724,775 base firms
  • Applies realistic statistical distributions within each band (log-normal for large firms, beta for small firms)
  • Creates the "shape" of the economy - how firms are distributed across sectors and sizes
  1. Sector Distribution Preservation:
  • Maintains ONS sector structure as the foundation
  • Ensures realistic within-sector firm size distributions
  • Preserves the economic composition (manufacturing, services, retail, etc.)
  1. Turnover Generation Parameters:
  • Uses ONS band boundaries to generate realistic turnover values
  • Applies different probability distributions per band to match real-world patterns
  • Creates individual firm records with continuous turnover values (not just band categories)

HMRC Data Usage:

  1. Calibration Targets (Step 2):
  • Sector targets: Uses HMRC sector totals (2023-24 column) as calibration goals
  • Turnover band targets: Uses HMRC band totals (Negative_or_Zero, £1_to_Threshold, etc.)
  • Overall total: Uses HMRC total (2,178,950) as the final target count
  1. Missing Data (Step 4):
  • Adds 216,500 negative/zero turnover firms (not in ONS data)
  • Distributes these across sectors using HMRC sector weights
  • Fills the gap where ONS data is incomplete
  1. Two-Stage Calibration Validation:
  • Stage 1: Forces exact sector matching using HMRC sector quotas
  • Stage 2: Adjusts turnover bands to match HMRC band distribution
  • Quality control: Validates against both HMRC datasets simultaneously

Integration Strategy:

  • ONS = Foundation: Provides realistic economic structure and firm size patterns
  • HMRC = Calibration: Provides validation targets and missing data completion
  • Balance: Preserves ONS realism while achieving HMRC statistical accuracy

The Process Flow:

  1. Generate 2.7M firms using ONS structure → realistic base
  2. Calculate how ONS differs from HMRC → calibration factors
  3. Sample 2.2M firms using sector-stratified approach → perfect sector match
  4. Adjust turnover values to improve band matching → fine-tuning
  5. Validate against both datasets → quality assurance

Why This Approach Works:

  • ONS provides authenticity - real firm size distributions by industry
  • HMRC provides accuracy - official statistics for validation
  • Two-stage design - optimizes both objectives sequentially rather than compromising both
  • Results: 93.8% sector accuracy + 92.6% band accuracy = 93.2% overall

@vahid-ahmadi vahid-ahmadi self-assigned this Aug 12, 2025
@vahid-ahmadi vahid-ahmadi merged commit fcc4422 into main Aug 12, 2025
2 checks passed
daphnehanse11 pushed a commit to daphnehanse11/uk-vatlab that referenced this pull request Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate synthetic firm data

2 participants