Abstract: Time series in Electronic Health Records (EHRs) present unique challenges for generative models, such as irregular sampling, missing values, and high dimensionality. In this paper, we propose a novel generative adversarial network (GAN) model, TimEHR, to generate time series data from EHRs. In particular, TimEHR treats time series as images and is based on two conditional GANs. The first GAN generates missingness patterns, and the second GAN generates time series values based on the missingness pattern. Experimental results on three real-world EHR datasets show that TimEHR outperforms state-of-the-art methods in terms of fidelity, utility, and privacy metrics.
Clone the repository, create a virtual environment (venv
or conda
), and install the required packages using pip
:
# clone the repository
git clone https://github.com/hojjatkarami/TimEHR.git
cd TimEHR
# using virtualenv
python3 -m venv test2
source test2/bin/activate
# using conda
conda create --name TimEHR python=3.9.7 --yes
conda activate TimEHR
# install the required packages
pip install -r requirements.txt
We used three real-world EHRs datasets as well as simulated data in our experiments:
Dataset Name | Size | Number of Features |
---|---|---|
PhysioNet/Computing in Cardiology Challenge 2012 | 12k | 35 |
PhysioNet/Computing in Cardiology Challenge 2019 | 38k | 32 |
MIMIC-III | 51k | 37 |
Simulated Data | 10k | 16,32,64,128 |
We need to convert irregularly-sampled time series to images. Please refer to the data folder for more details on the datasets.
Converting time series to images.
We use hydra-core
library for managing all configuration parameters. You can change them from configs/config.yaml
.
We highly recommend using wandb
for logging and tracking the experiments. Get your API key from wandb. Create a .env
file in the root directory and add the following line:
WANDB_API_KEY=your_api_key
The following command will train the model and generate synthetic time series for P12-split0
(You should have prepared the data in the data
folder before running):
python train.py
This will train TimEHR modules (CWGAN-GP and Pix2Pix) for the default configuration (P12 dataset, split0) and prints the generated dataframe. Modules are saved locally in Results/{dataset}-s{split}/[CWGAN|Pix2Pix]/
folder as well as on wandb servers (account_name/[CWGAN|PIXGAN]
).
python eval.py Results/p12-s0
This will generate and evaluate synthetic time series for the trained models in the Results/p12-s0
folder and save the results in a wandb project TimEHR-Eval
as well as locally in the Results/p12-s0/TimEHR-Eval
folder.
For a more in-depth tutorial on how to train, generate, evaluate, and visualize the synthetic data, please checkout our notebook Tutorial.ipynb.
To replicate the results in the paper, please follow the steps below:
- Run the following commands:
python train.py -m data=p12 split=0,1,2,3,4 python train.py -m data=mimic split=0,1,2,3,4 python train.py -m data=p19 split=0,1,2,3,4 pix2pix.lambda_l1=100
- Use
python eval.py Results/{dataset}-s{split}
for the evaluation. The results will be saved in wanbd dashboard (account_name/TimEHR-Eval
).
If you find this repo useful, please cite our paper via
@article{karami2024timehr,
title={TimEHR: Image-based Time Series Generation for Electronic Health Records},
author={Karami, Hojjat and Hartley, Mary-Anne and Atienza, David and Ionescu, Anisoara},
journal={arXiv preprint arXiv:2402.06318},
year={2024}
}