Insurance Data Augmentation

Synthetic data pipeline for tabular insurance data. Scales 1,337 original rows to 50,000 synthetic rows using multiple generative models and an 11-section QC suite for fidelity, privacy, and structural checks.

The repository keeps generated outputs out of version control by default: insurance_augmented_v5.csv, qc_report_v5.txt, model pickles, MLflow runs, and local virtual environments are ignored.

Models

Model	Overall QC	TSTR R²	Notes
tabddpm	PASS	0.88	Best reported fidelity; requires PyTorch
tvae	FAIL*	0.86	Code default; works without PyTorch
ctgan	FAIL	0.73	GAN baseline
dp_ctgan	FAIL	0.59	Differential privacy only

* FAIL does not mean the model is broken or unusable. It means one or more QC thresholds were exceeded on the benchmark run — in TVAE's case, a single Spearman correlation check. The synthetic data is still of high quality (TSTR R²=0.86). Review the full qc_report_v5.txt for section-by-section details before drawing conclusions.

Dataset

insurance_orig.csv is the Medical Cost Personal Datasets published on Kaggle by Miri Choi under the CC0 1.0 Universal (Public Domain) licence.

Source: https://www.kaggle.com/datasets/mirichoi0218/insurance
Licence: CC0 1.0 — no restrictions on use, reproduction, or distribution
Rows: 1,338 — de-identified US health insurance charges

If you are using a different source file, confirm its licence before publishing this repository.

Project Files

augment_insurance_v5_6.py — main CLI pipeline
insurance_orig.csv — source dataset (CC0 1.0, see Dataset above)
requirements.txt — core Python dependencies
.gitignore — excludes generated data, local environments, and model artifacts

Install

Requires Python ≥ 3.9.

Create and activate a fresh virtual environment, then install the core dependencies:

pip install -r requirements.txt

For TabDDPM, install PyTorch separately for your platform. For example:

# CPU-only
pip install torch --index-url https://download.pytorch.org/whl/cpu

# CUDA 12.1 (adjust cu121 to match your driver)
pip install torch --index-url https://download.pytorch.org/whl/cu121

See https://pytorch.org/get-started/locally/ for the full matrix.

Usage

Run the default TVAE model:

python augment_insurance_v5_6.py --input insurance_orig.csv --model tvae

Run TabDDPM after installing PyTorch:

python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm

Auto-select the best available model:

python augment_insurance_v5_6.py --input insurance_orig.csv --auto-select

Save and reload a trained TabDDPM model:

python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm --save-model ddpm.pkl
python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm --load-model ddpm.pkl --seed 99

Important

Evaluate downstream models on original held-out test data only
Never use synthetic data for evaluation
MIA AUC elevation at 37x multiplier is structural, not memorisation
Add a repository licence before making the project public

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
augment_insurance_v5_6.py		augment_insurance_v5_6.py
insurance_orig.csv		insurance_orig.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insurance Data Augmentation

Models

Dataset

Project Files

Install

Usage

Important

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Insurance Data Augmentation

Models

Dataset

Project Files

Install

Usage

Important

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages