Synthetic data pipeline for tabular insurance data. Scales 1,337 original rows to 50,000 synthetic rows using multiple generative models and an 11-section QC suite for fidelity, privacy, and structural checks.
The repository keeps generated outputs out of version control by default:
insurance_augmented_v5.csv, qc_report_v5.txt, model pickles, MLflow runs,
and local virtual environments are ignored.
| Model | Overall QC | TSTR R² | Notes |
|---|---|---|---|
| tabddpm | PASS | 0.88 | Best reported fidelity; requires PyTorch |
| tvae | FAIL* | 0.86 | Code default; works without PyTorch |
| ctgan | FAIL | 0.73 | GAN baseline |
| dp_ctgan | FAIL | 0.59 | Differential privacy only |
* FAIL does not mean the model is broken or unusable. It means one or more
QC thresholds were exceeded on the benchmark run — in TVAE's case, a single
Spearman correlation check. The synthetic data is still of high quality
(TSTR R²=0.86). Review the full qc_report_v5.txt for section-by-section
details before drawing conclusions.
insurance_orig.csv is the Medical Cost Personal Datasets published on
Kaggle by Miri Choi under the CC0 1.0 Universal (Public Domain) licence.
- Source: https://www.kaggle.com/datasets/mirichoi0218/insurance
- Licence: CC0 1.0 — no restrictions on use, reproduction, or distribution
- Rows: 1,338 — de-identified US health insurance charges
If you are using a different source file, confirm its licence before publishing this repository.
augment_insurance_v5_6.py— main CLI pipelineinsurance_orig.csv— source dataset (CC0 1.0, see Dataset above)requirements.txt— core Python dependencies.gitignore— excludes generated data, local environments, and model artifacts
Requires Python ≥ 3.9.
Create and activate a fresh virtual environment, then install the core dependencies:
pip install -r requirements.txtFor TabDDPM, install PyTorch separately for your platform. For example:
# CPU-only
pip install torch --index-url https://download.pytorch.org/whl/cpu
# CUDA 12.1 (adjust cu121 to match your driver)
pip install torch --index-url https://download.pytorch.org/whl/cu121See https://pytorch.org/get-started/locally/ for the full matrix.
Run the default TVAE model:
python augment_insurance_v5_6.py --input insurance_orig.csv --model tvaeRun TabDDPM after installing PyTorch:
python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpmAuto-select the best available model:
python augment_insurance_v5_6.py --input insurance_orig.csv --auto-selectSave and reload a trained TabDDPM model:
python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm --save-model ddpm.pkl
python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm --load-model ddpm.pkl --seed 99- Evaluate downstream models on original held-out test data only
- Never use synthetic data for evaluation
- MIA AUC elevation at 37x multiplier is structural, not memorisation
- Add a repository licence before making the project public