Comparing artificial intelligence strategies for early sepsis detection in the ICU: an experimental study
This repository contains the code used in the paper with the same title. To cite the article, please use the following guide:
@article{solis2023comparing,
title={Comparing artificial intelligence strategies for early sepsis detection in the ICU: an experimental study},
author={Sol{\'\i}s-Garc{\'\i}a, Javier and Vega-M{\'a}rquez, Bel{\'e}n and Nepomuceno, Juan A and Riquelme-Santos, Jos{\'e} C and Nepomuceno-Chamorro, Isabel A},
journal={Applied Intelligence},
pages={1--15},
year={2023},
publisher={Springer}
}
- Abstract
- Authors
- Prerequisites
- How to download and generate the data
- What does the repository?
- How to define a parameter optimization and an experiment?
- How to launch a parameter optimization and an experiment?
- How works the code of the repository?
- License
Sepsis is a life-threatening condition whose early recognition is key to improving outcomes for patients in intensive care units (ICUs). Artificial intelligence can play a crucial role in mining and exploiting health data for sepsis prediction. However, progress in this field has been impeded by a lack of comparability across studies. Some studies do not provide code, and each study independently processes a dataset with large numbers of missing values. Here, we present a comparative analysis of early sepsis prediction in the ICU by using machine learning (ML) algorithms and provide open-source code to the community to support future work. We reviewed the literature and conducted two phases of experiments. In the first phase, we analyzed five imputation strategies for handling missing data in a clinical dataset (which is often sampled irregularly and requires hand-crafted preprocessing steps). We used the MIMIC-III dataset, which includes more than 5,800 ICU hospital admissions from 2001 to 2012. In the second phase, we conducted an extensive experimental study using five ML methods and five popular deep learning models. We evaluated the performance of the methods by using the area under the precision-recall curve, a standard metric for clinical contexts. The deep learning methods (TCN and LSTM) outperformed the other methods, particularly in early detection tasks more than 4 hours before sepsis onset. The motivation for this work was to provide a benchmark framework for future research, thus enabling advancements in this field.
- Javier Solís-García
- Belén Vega-Márquez
- Juan Nepomuceno
- José C. Riquelme-Santos
- Isabel A. Nepomuceno-Chamorro
This repository has been tested with the following requirements; however, it may be run with a different version of the listed software:
- Ubuntu 20.04 or 22.04
- Docker version 20.10.18
- docker-compose version 1.29.2
- Nvidia Container Tookit
The repository can be cloned with the command: git clone https://github.com/javiersgjavi/sepsis-review.git
This dataset is only used in the Appendix of the article. It is available from Physionet Challenge website. However, in this repository we have downloaded the data available in Kaggle Kaggle which already mix the data from the two different hospitals.
- Make sure you have installed the Kaggle API:
pip install kaggle
. - Give execution files to the file download_data.sh with the command:
chmod +x download_data.sh
. - Execute the file download_data.sh with the command:
./download_data.sh
.
This is the dataset which is mainly used in the article
INFO:
: If you are only interested in obtaining the data, I recommend you to visit this other repository
WARNING:
This guide has been made for Ubuntu 20.04, it should be similar for other Linux versions, but may differ for a different operating system
If you are only interested in obtaining the data, I recommend you to visit this other repository
Download data from PhysioNet and create PostgresSQL DB with the MIT-LCP/mimic-code repository
-
Request access to MIMIC-III data in PhysioNet.
-
Go to the folder build_mimic_data:
cd build_mimic_data
. -
Add your user of physionet with access privileges to MIMIC-III in line1 of file download_physionet_data.sh, where it says <User_of_physionet>.
-
Give execution permissions to all the .sh files with the following command:
chmod +x *.sh
. -
Execute the following command:
sudo ./download_physionet_data.sh
. The process can be long and will require:
- Enter the password of your PhysioNet account to download the data.
- You will watch the log of the postgres database being loaded. You must wait until you see a table with the tables of the DB where all rows return "PASSED", and below must display the message "Done!" before going to the next steps.
- Once the process is finished, press:
Ctrl + C
to finish the display of the log.
Create CSV files from the PostgresSQL DB with the BorgwardtLab/mgp-tcn repository
- Execute the following command:
./preprocess_data.sh
. - Give execution permissions main.sh file with the command:
chmod +x src/query/main.sh
. - Execute the command:
make query
. This process will be longer than the one in the 6 step.
Create the final data with BorgwardtLab/mgp-tcn repository
- Execute the command: make generate_data to generate the final data that the repository will use.
- Exit from the container with the command:
exit
. - Execute the command:
./create_data_folder.sh
.
This repository focuses on early prediction of sepsis onset. To archive this task, the repository is divided into two main sections:
-
Parameter Optimization: In this part of the code, the mission is to find the best combination of parameters and imputation method to classify patients who will have an onset of sepsis with data from the 49 hours prior to sepsis onset for each model tested. In addition, this will create the data to make an experimental comparison between different imputation techniques and models.
-
Experiment: Once the optimization of the parameters is done, with the best configuration for each model, an experiment will be performed in which different time horizons before the onset of sepsis will be tested to check the performance of each model.
The characteristics of the parameter optimization and the experiments are defined in the file main.py. these are some important variables:
-
name: defines the name of the experiment and the name of the folder with the results.
-
iterations_sampler: defines the max number of different configurations of hiper-parameters that are tested for each model. Is used to reduce the execution time by reducing the total amount of models tested. For more exhaustive experimentation, this number must be increased.
-
models : there is one variable for each task, which defines the models that are going to be used. The variable is a dict with all the models implemented for the task, if you want to enable any of them, you have to put 1 as the value of the key. To deactivate the model, you have to put 0 as the value.
-
imputation_methods: this one is also similar to the last two. It is a dict with all available imputation methods, put 1 to activate an imputation method, or 0 to deactivate it.
-
data: define the name of the data that is going to be use.
On the other hand, the main variable of the experiment is:
- hours_before_onset: During the experiment, models will be tested with different horizons, ranging from 1 hour before sepsis onset to the number of hours before onset defined in this variable.
-
It is important to give execution permissions to the start.sh_ file if it hasn't with the command:
chmod +x start.sh
. -
Start and enter to the docker container with the command:
./start.sh
. -
Launch the experiment with the command:
python main.py
. -
If you want to execute in second plane, use the command:
nohup python main.py
.
- main.py: it is the script that defines and launches a parameter optimization and an experiment.
- src/utils/generate_reports.py: it is a script that generates all tables and images with the summary of the parameter optimization and experiment results.
- src/utils/preprocess_data.py: it is a script that contains all functions that preprocess the data. This script contains all functions to make the different imputations methods and probably is the more messy and difficult file to understand in this repository. I'm sorry for the untidy state of this file, but the management of the input data was very chaotic due to its format.
- src/classes/ParameterOptimization.py: is one of the main class of this repository. Prepare de data, execute the models and save the results.
- src/classes/Experiment.py: is the other main class in this repository. It loads each model with the best parameters and the best imputation method for the data found during the optimization, and tests each model with different horizons, which will start from one hour before start to as many hours before start as defined in main.py.
- src/classes/Data.py: this file contains classes related to the management of the data, the normalization, and the hyperparameter values for the models.
- src/classes/DL.py: this file contains the Deep Learning models implemented in this project, whose models extend the class DL. The addition of a new model is very modular and only must extend the DL class, similar to the other ones.
- src/classes/ML.py: this file contains the Machine Learning models implemented in this project, whose models extend the class ML. The addition of a new model is also modular and only must extend the ML class, similar to the other ones.
- src/classes/Metrics.py: this file contains the metrics implemented in this project. To add a new metric, it must be added as a method in the MetricCalculator class.
- docker/: this folder contains the files to create the docker image and the docker container.
- parameters.json: this file contains the possible values for the parameters of each model.
- data: this folder will be generated during the execution of the repository and will contain the original data and the imputed data to reduce the execution time.
- results: this folder will be generated during the execution of the repository too. It will contain the tables, images, and predictions of the results.csv which is the file that contains all the data of the parameter optimization and the experiment.
- build_mimic_data: this folder is used to download and generate the data which will be used. It contains the clone of two repositories:
- MIT-LCP/mimic-code repository: is used to download de MIMIC-III data from physionet
- BorgwardtLab/mgp-tcn: is used to generate the data that will be used by the experiments.
This project is licensed under the BSD-3-Clause license - see the LICENSE file for details