The official implementation of the paper "GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks." If you found our work interesting and plan to re-use the code, please cite us as:
@article{
author = {Renat Sergazinov and Valeriya Rogovchenko and Elizabeth Chun and Nathaniel Fernandes and Irina Gaynanova},
title = {GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks},
journal = {arXiv}
year = {2023},
}
We recommend to setup clean Python environment with conda by running conda create -n glucobench python=3.10
. Then we can install all dependenices by running pip install -r requirments.txt
.
To run Latent ODE model, install torchdiffeq
.
The code is organized as follows:
bin/
: training commands for all modelsconfig/
: configuration files for all datasetsdata_formatter/
- base.py: performs all pre-processing for all CGM datasets
exploratory_analysis/
: notebooks with processing steps for pulling the data and converting to.csv
fileslib/
gluformer/
: model implementationlatent_ode/
: model implementation*.py
: hyper-paraemter tuning, training, validation, and testing scripts
output/
: hyper-parameter optimization and testing logspaper_results/
: code for producing tables and plots, found in the paperutils/
: helper functions for model training and testingraw_data.zip
: web-pulled CGM data (processed usingexploratory_analysis
)environment.yml
: conda environment file
The datasets are distributed according to the following licences and can be downloaded from the following links outlined in the table below.
Dataset | License | Number of patients | CGM Frequency |
---|---|---|---|
Colas | Creative Commons 4.0 | 208 | 5 minutes |
Dubosson | Creative Commons 4.0 | 9 | 5 minutes |
Hall | Creative Commons 4.0 | 57 | 5 minutes |
Broll | GPL-2 | 5 | 5 minutes |
Weinstock | Creative Commons 4.0 | 200 | 5 minutes |
Tamborlane | 450 | 5 minutes |
To process the data, follow the instructions in the exploratory_analysis/
folder. Processed datasets should be saved in the raw_data/
folder. We provide examples in the raw_data.zip
file.
We recommend setting up a clean Python environment using Conda. Follow these steps:
-
Create a new environment named
glucobench
with Python 3.10 by running:conda env create -n glucobench python=3.10
-
Activate the environment with:
conda activate glucobench
-
Install all required dependencies by running:
pip install -r requirements.txt
-
(Optional) To confirm that you're installing in the correct environment, run:
which pip
This should display the path to the
pip
executable within theglucobench
environment."
The config/
folder stores the best hyper-parameters (selected by Optuna) for each dataset and model. The config/
also stores the dataset-specific parameters for interpolation, dropping, splitting, and scaling. To train and evaluate the models with these defaults, we can simply run:
python ./lib/model.py --dataset dataset --use_covs False --optuna False
To change the search grid for hyper-parameters, we need to modify the ./lib/model.py
file. Specifically, we look at the objective()
function and modify the trial.suggest_*
parameters to set the desired ranges. Once we are done, we can run the following command to re-run the hyper-parameter optimization:
python ./lib/model.py --dataset dataset --use_covs False --optuna True
We provide a detailed example of the workflow in the example.ipynb
notebook. For clarification, we provide some general suggestions below in order of increasing complexity.
To start experimenting with the data, we can run the following command:
import yaml
from data_formatter.base import DataFormatter
with open(f'./config/{dataset}.yaml', 'r') as f:
config = yaml.safe_load(f)
formatter = DataFormatter(config)
The command exposes an object of class DataFormatter
which automatically pre-processes the data upon initialization. The pre-processing steps can be controlled via the config/
files. The DataFormatter
object exposes the following attributes:
formatter.train_data
: training data (aspandas.DataFrame
)formatter.val_data
: validation dataformatter.test_data
: testing (in-distribution and out-of-distribution) data i.formatter.test_data.loc[~formatter.test_data.index.isin(formatter.test_idx_ood)]
: in-distribution testing data ii.formatter.test_data.loc[formatter.test_data.index.isin(formatter.test_idx_ood)]
: out-of-distribution testing dataformatter.data
: unscaled full data
Training models with PyTorch typically boils down to (1) defining a Dataset
class with __getitem__()
method, (2) wrapping it into a DataLoader
, (3) defining a torch.nn.Module
class with forward()
method that implements the model, and (4) optimizing the model with torch.optim
in a training loop.
Parts (1) and (2) crucically depend on the definition of the Dataset
class. Essentially, having the data in the table format (e.g. formatter.train_data
), how do we sample input-output pairs and pass the covariate information? The various Dataset
classes conveniently adopted from the Darts
library (see here) offer one way to wrap the data into a Dataset
class. Different Dataset
classes differ in what information is provided to the model:
SamplingDatasetPast
: supports only past covariatesSamplingDatasetDual
: supports only future-known covariatesSamplingDatasetMixed
: supports both past and future-known covariates
Below we give an example of loading the data and wrapping it into a Dataset
:
from utils.darts_processing import load_data
from utils.darts_dataset import SamplingDatasetDual
formatter, series, scalers = load_data(seed=0,
dataset=dataset,
use_covs=True,
cov_type='dual',
use_static_covs=True)
dataset_train = SamplingDatasetDual(series['train']['target'],
series['train']['future'],
output_chunk_length=out_len,
input_chunk_length=in_len,
use_static_covariates=True,
max_samples_per_ts=max_samples_per_ts,)
Parts (3) and (4) are model-specific, so we omit their discussion. For inspiration, we suggest to take a look at the lib/gluformer/model.py
and lib/latent_ode/trainer_glunet.py
files.