VCI-Bayes-Explore packages the preprocessing and modelling workflow behind the Heart-Brain Connection Bayesian Network analysis led by Malin Overmars, PhD within the Vascular Cognitive Impairment (VCI) research group of the UMC Utrecht.
It turns raw data into a preprocessed dataset and reproduces a clinically informed, layered Bayesian network—demographics → vascular risk → neuroimaging → function → outcomes—that learns dependencies among 566 participants of the Heart-Brain Connection study.
The pipeline quantifies conditional probabilities for outcomes cognitive decline and major adverse cardiovasuclar events (MACE), benchmarks emerging biomarkers via mutual-information analyses, and supports patient-level inference while explicitly modelling dropout effects observed in the cohort.
Use this repository to:
- Keep sensitive file locations outside version control while configuring project-wide paths;
- Run the preprocessing pipeline that labels, engineers, and imputes cohort variables;
- Learn, constrain, and visualise the Bayesian network inside a ready-to-run notebook;
- Inspect the generated parquet outputs, figures, and companion documentation.
The layout is designed so collaborators, reviewers, and future cohort expansions can retrace each analysis step while still allowing adaptations for new datasets.
For more information about the Heart-Brain Connection study, check: https://hart-brein.nl/. This work is supported by the Dutch Heart Foundation.
The accompanying manuscript is currently in preparation 📄.
This repository is released under the MIT License.
If you use this code, please cite: [] (https://doi.org/10.5281/zenodo.17302710). Thanks!
⚠️ Important: This code is tailored to the Heart-Brain Connection cohort and its definitions. If you apply it to a different dataset, review and adapt both the preprocessing (src/preprocess_data.py) and the Bayesian-network notebook (src/bayesian_network.ipynb) so the logic matches your cohort’s structure, coding, and requirements.
- Install Python 3.13+ (e.g. from python.org) and open a terminal in the project folder.
- Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install pandas numpy pyreadstat scikit-learn PyYAML pyagrum matplotlib scipy ipywidgets
- Tell the pipeline where your raw data live:
Edit
cp config/data_paths.example.yml config/data_paths.yml
config/data_paths.ymland fill in:raw_dir: folder with the SPSS.savfilescodebook_path: path toHBC_CODEBOOK_LABELS.xlsxoutput_dir: leave assrc/outto keep processed data inside the repo
- Preprocess the data:
You should see log messages ending with “Wrote df.parquet …”. The processed files (
python src/preprocess_data.py
df.parquet,df_imp.parquet,bn_vars.parquet) appear insrc/out/. - Open the analysis notebook: launch Jupyter and run
src/bayesian_network.ipynb. The first configuration cell automatically reads the parquet files fromsrc/out/. Click “Run All” to reproduce the figures.
That’s it—you now have the same dataset and model that produced the manuscript figures. Need more control? Jump to the sections below.
config/data_paths.yml keeps sensitive paths out of version control (the file is git-ignored). It accepts the following keys:
| Key | Description |
|---|---|
raw_dir |
Folder containing the raw SPSS exports (df.sav, fu_2.sav, etc.). |
codebook_path |
Absolute path to HBC_CODEBOOK_LABELS.xlsx. |
output_dir |
Where processed parquet files are written. Default behaviour writes to src/out. |
risk_region |
SCORE2 region used for cardiovascular risk (defaults to "Low"). |
All paths may be relative to the repository root. Example:
risk_region: Low
raw_dir: "/secure/location/hartbrein/raw"
codebook_path: "/secure/location/hartbrein/meta/HBC_CODEBOOK_LABELS.xlsx"
output_dir: "src/out"python src/preprocess_data.pyThe script:
- reads the raw SPSS tables and codebook,
- applies the SPSS value labels (Dutch → English),
- constructs the outcome variables (
OUTCOME_MACE,OUTCOME_CDR_INCREASE) - computes the SCORE2 cardiovascular risk score,
- imputes missing numeric values with
IterativeImputer, - writes
df.parquet,df_imp.parquet, andbn_vars.parquettosrc/out/.
- Start Jupyter (or VS Code, or JupyterLab) in the repository.
- Open
src/bayesian_network.ipynb. - Execute the cells in order. The first cell auto-detects the processed data under
src/out/and loads:df.parquet: the labelled, non-imputed dataset (categorical labels preserved).df_imp.parquet: the imputed dataset used for learning.bn_vars.parquet: metadata linking each variable to its expert-defined layer.
- Subsequent sections:
- Discretisation — uses
pyAgrum’sDiscreteTypeProcessorwith quantile-based binning. - Structure learning — enforces the layer constraints and adds explicit arcs from the outcome nodes to the dropout layer.
- Inference & visualisation — produces network and posterior plots, CPT displays
- Discretisation — uses
├── README.md Project guide (this file)
├── logo-vci-bayes.png Banner used in the README
├── config/
│ ├── data_paths.example.yml Template pointing to raw data
│ └── config.yaml Extra notebook settings (optional)
├── src/
│ ├── preprocess_data.py End-to-end data preparation script
│ ├── bayesian_network.ipynb Main analysis and figures
│ └── out/ Default location for processed parquet files
└── docs/, graphs/, cache/, … Supporting material
- Value labels: SPSS value labels are applied before any logic runs; Dutch strings such as
"Ja, Herseninfarct"become"Yes, ischemic stroke". - Outcome definitions:
OUTCOME_MACEis “Yes” if either T2 or T4 indicates a stroke/cardiac event or the recorded cause of death mentions key terms (myocardial infarction, cerebral hemorrhage, etc.).OUTCOME_CDR_INCREASEis “Yes” if the CDR score increases at T2 or T4 or the participant leaves follow-up with the reason (“Moved to Nursing Home”). Dropouts without recorded events are labelled “Unobserved”.
- Layer metadata:
bn_vars.parquetstrips whitespace and normalises the layer names (for consistent colouring in the notebook plots). - Risk score: SCORE2 is calculated via a Python translation of the
RiskScorescvd::SCORE2function. - Imputation: Numeric features use
IterativeImputer(sklearn). Categorical variables retain the translated labels.
- Python ≥ 3.12
- pandas, numpy
- pyreadstat
- scikit-learn
- PyYAML
- matplotlib
- PyAgrum (including
pyagrum.skbn,pyagrum.lib.notebook, etc.) - SciPy
- ipywidgets
Install manually (pip install …) or via a requirements file if you maintain one.
