This repo contains experiments for "Clustering Interval-Censored Data for Disease Phenotyping" (AAAI 2022). The code contains scripts to generate synthetic code for experiment as well as two clinical datasets: 1) Parkinson's disease from the publically-available Parkinson's Progression Marker Initiative and 2) a heart failure collected from Beth Israel Deaconness Medical Center.
Code is released for the purposes of transparency and replication. It has not been extensively cleaned --- nor was it designed to be run directly from this repo. Certain terms have been redacted for security concerns. If there are any questions, please contact Irene Chen at iychen@csail.mit.edu.
Code is written for Python 3.7.
Figure 1 comprises of three drawings in Powerpoint, which are not algorithmically generated.
Figure 2a is generated in two steps. For all commands, data_num
corresponds to the synthetic data setting. For example, sigmoid data is data_num=1
whereas the quadratic data settings are integers larger than 1.
First, best hyperparameters for SubLign and associated performance are found according to python model/hpsearch.py --data_num 1 --epochs 1000
.
Next, baselines are computed according to corresponding scripts in baselines
. For example, for KMeans+Loss: python kmeans.py --data_num 1 --epochs 1000 --trials 5
. Similar format follows for the other baselines, with the exception of PARAFAC2 which was implemented in parafac.m
(see file for execution instructions).
Figure 2b is generated similarly with the dataset denoted with a --ppmi
tag: python model/hpsearch.py --ppmi --epochs 1000
.
Figure 3 is generated in model/CHF_Experiment.ipynb
Model misspecfication experiments can be found in model/misspecification.py
. For example: python misspecification.py --increasing
Table 2-4 are generated similar to Figure 2a with a different data_num. For data with 50% missing, use data_num=11
. For data with 25% missing, use data_num=12
. For data with 0% missing, use data_num=13
.
Table 5 is manually created and does not need code to produce.
Similar to the sigmoid experiments, we find best hyperparameters for SubLign and associated performance: python model/hpsearch.py --data_num 3 --epochs 1000
. The data number is an integer from 3-8 inclusive.
Baselines are computed according to corresponding scripts in baselines
. For example, for KMeans+Loss: python kmeans.py --data_num 3 --epochs 1000 --trials 5
.
For PD clinical subtypes, we find the best hyperparameters with python cross_validation/hpsearch.py --ppmi --epochs 1000
and then compute the corresponding subtypes with model/ClinicalSubtypes.ipynb
.
We compute the KMeans+Loss subtypes for the HF dataset in model/ClinicalSubtypes.ipynb
.
For HF clinical subtypes, we find the best hyperparameters with python cross_validation/hpsearch.py --chf --epochs 1000
and then compute the corresponding subtypes with model/ClinicalSubtypes.ipynb
.
For the HF semi-synthetic experiment, run python model/run_chf_experiment.py --thresh 0.25 --chf
.