Skip to content

PRANAVGAWALE-DS/Data-Augmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insurance Data Augmentation

Synthetic data pipeline for tabular insurance data. Scales 1,337 original rows to 50,000 synthetic rows using multiple generative models and an 11-section QC suite for fidelity, privacy, and structural checks.

The repository keeps generated outputs out of version control by default: insurance_augmented_v5.csv, qc_report_v5.txt, model pickles, MLflow runs, and local virtual environments are ignored.

Models

Model Overall QC TSTR R² Notes
tabddpm PASS 0.88 Best reported fidelity; requires PyTorch
tvae FAIL* 0.86 Code default; works without PyTorch
ctgan FAIL 0.73 GAN baseline
dp_ctgan FAIL 0.59 Differential privacy only

* FAIL does not mean the model is broken or unusable. It means one or more QC thresholds were exceeded on the benchmark run — in TVAE's case, a single Spearman correlation check. The synthetic data is still of high quality (TSTR R²=0.86). Review the full qc_report_v5.txt for section-by-section details before drawing conclusions.

Dataset

insurance_orig.csv is the Medical Cost Personal Datasets published on Kaggle by Miri Choi under the CC0 1.0 Universal (Public Domain) licence.

If you are using a different source file, confirm its licence before publishing this repository.

Project Files

  • augment_insurance_v5_6.py — main CLI pipeline
  • insurance_orig.csv — source dataset (CC0 1.0, see Dataset above)
  • requirements.txt — core Python dependencies
  • .gitignore — excludes generated data, local environments, and model artifacts

Install

Requires Python ≥ 3.9.

Create and activate a fresh virtual environment, then install the core dependencies:

pip install -r requirements.txt

For TabDDPM, install PyTorch separately for your platform. For example:

# CPU-only
pip install torch --index-url https://download.pytorch.org/whl/cpu

# CUDA 12.1 (adjust cu121 to match your driver)
pip install torch --index-url https://download.pytorch.org/whl/cu121

See https://pytorch.org/get-started/locally/ for the full matrix.

Usage

Run the default TVAE model:

python augment_insurance_v5_6.py --input insurance_orig.csv --model tvae

Run TabDDPM after installing PyTorch:

python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm

Auto-select the best available model:

python augment_insurance_v5_6.py --input insurance_orig.csv --auto-select

Save and reload a trained TabDDPM model:

python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm --save-model ddpm.pkl
python augment_insurance_v5_6.py --input insurance_orig.csv --model tabddpm --load-model ddpm.pkl --seed 99

Important

  • Evaluate downstream models on original held-out test data only
  • Never use synthetic data for evaluation
  • MIA AUC elevation at 37x multiplier is structural, not memorisation
  • Add a repository licence before making the project public

About

Synthetic data pipeline for tabular insurance data. Scales 1,337 rows to 50,000 using TVAE, CTGAN, TabDDPM with an 11-section QC suite.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages