VCI-Bayes-Explore

VCI-Bayes-Explore packages the preprocessing and modelling workflow behind the Heart-Brain Connection Bayesian Network analysis led by Malin Overmars, PhD within the Vascular Cognitive Impairment (VCI) research group of the UMC Utrecht.

It turns raw data into a preprocessed dataset and reproduces a clinically informed, layered Bayesian network—demographics → vascular risk → neuroimaging → function → outcomes—that learns dependencies among 566 participants of the Heart-Brain Connection study.

The pipeline quantifies conditional probabilities for outcomes cognitive decline and major adverse cardiovasuclar events (MACE), benchmarks emerging biomarkers via mutual-information analyses, and supports patient-level inference while explicitly modelling dropout effects observed in the cohort.

Use this repository to:

Keep sensitive file locations outside version control while configuring project-wide paths;
Run the preprocessing pipeline that labels, engineers, and imputes cohort variables;
Learn, constrain, and visualise the Bayesian network inside a ready-to-run notebook;
Inspect the generated parquet outputs, figures, and companion documentation.

The layout is designed so collaborators, reviewers, and future cohort expansions can retrace each analysis step while still allowing adaptations for new datasets.

For more information about the Heart-Brain Connection study, check: https://hart-brein.nl/. This work is supported by the Dutch Heart Foundation.

The accompanying manuscript is currently in preparation 📄.

License & Citation

This repository is released under the MIT License.

If you use this code, please cite: [] (https://doi.org/10.5281/zenodo.17302710). Thanks!

⚠️ Important: This code is tailored to the Heart-Brain Connection cohort and its definitions. If you apply it to a different dataset, review and adapt both the preprocessing (src/preprocess_data.py) and the Bayesian-network notebook (src/bayesian_network.ipynb) so the logic matches your cohort’s structure, coding, and requirements.

Quick start

Install Python 3.13+ (e.g. from python.org) and open a terminal in the project folder.

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install pandas numpy pyreadstat scikit-learn PyYAML pyagrum matplotlib scipy ipywidgets

Tell the pipeline where your raw data live:
```
cp config/data_paths.example.yml config/data_paths.yml
```
Edit config/data_paths.yml and fill in:
- raw_dir: folder with the SPSS .sav files
- codebook_path: path to HBC_CODEBOOK_LABELS.xlsx
- output_dir: leave as src/out to keep processed data inside the repo
Preprocess the data:
```
python src/preprocess_data.py
```
You should see log messages ending with “Wrote df.parquet …”. The processed files (df.parquet, df_imp.parquet, bn_vars.parquet) appear in src/out/.
Open the analysis notebook: launch Jupyter and run src/bayesian_network.ipynb. The first configuration cell automatically reads the parquet files from src/out/. Click “Run All” to reproduce the figures.

That’s it—you now have the same dataset and model that produced the manuscript figures. Need more control? Jump to the sections below.

Data preparation in detail

1. Configure file locations

config/data_paths.yml keeps sensitive paths out of version control (the file is git-ignored). It accepts the following keys:

Key	Description
`raw_dir`	Folder containing the raw SPSS exports (`df.sav`, `fu_2.sav`, etc.).
`codebook_path`	Absolute path to `HBC_CODEBOOK_LABELS.xlsx`.
`output_dir`	Where processed parquet files are written. Default behaviour writes to `src/out`.
`risk_region`	SCORE2 region used for cardiovascular risk (defaults to `"Low"`).

All paths may be relative to the repository root. Example:

risk_region: Low
raw_dir: "/secure/location/hartbrein/raw"
codebook_path: "/secure/location/hartbrein/meta/HBC_CODEBOOK_LABELS.xlsx"
output_dir: "src/out"

2. Run the preprocessing script

python src/preprocess_data.py

The script:

reads the raw SPSS tables and codebook,
applies the SPSS value labels (Dutch → English),
constructs the outcome variables (OUTCOME_MACE, OUTCOME_CDR_INCREASE)
computes the SCORE2 cardiovascular risk score,
imputes missing numeric values with IterativeImputer,
writes df.parquet, df_imp.parquet, and bn_vars.parquet to src/out/.

Running the Bayesian-network notebook

Start Jupyter (or VS Code, or JupyterLab) in the repository.
Open src/bayesian_network.ipynb.
Execute the cells in order. The first cell auto-detects the processed data under src/out/ and loads:
- df.parquet: the labelled, non-imputed dataset (categorical labels preserved).
- df_imp.parquet: the imputed dataset used for learning.
- bn_vars.parquet: metadata linking each variable to its expert-defined layer.
Subsequent sections:
- Discretisation — uses pyAgrum’s DiscreteTypeProcessor with quantile-based binning.
- Structure learning — enforces the layer constraints and adds explicit arcs from the outcome nodes to the dropout layer.
- Inference & visualisation — produces network and posterior plots, CPT displays

Repository layout

├── README.md                 Project guide (this file)
├── logo-vci-bayes.png        Banner used in the README
├── config/
│   ├── data_paths.example.yml Template pointing to raw data
│   └── config.yaml           Extra notebook settings (optional)
├── src/
│   ├── preprocess_data.py    End-to-end data preparation script
│   ├── bayesian_network.ipynb Main analysis and figures
│   └── out/                  Default location for processed parquet files
└── docs/, graphs/, cache/, … Supporting material

How the pipeline works (for the technically curious)

Value labels: SPSS value labels are applied before any logic runs; Dutch strings such as "Ja, Herseninfarct" become "Yes, ischemic stroke".
Outcome definitions:
- OUTCOME_MACE is “Yes” if either T2 or T4 indicates a stroke/cardiac event or the recorded cause of death mentions key terms (myocardial infarction, cerebral hemorrhage, etc.).
- OUTCOME_CDR_INCREASE is “Yes” if the CDR score increases at T2 or T4 or the participant leaves follow-up with the reason (“Moved to Nursing Home”). Dropouts without recorded events are labelled “Unobserved”.
Layer metadata: bn_vars.parquet strips whitespace and normalises the layer names (for consistent colouring in the notebook plots).
Risk score: SCORE2 is calculated via a Python translation of the RiskScorescvd::SCORE2 function.
Imputation: Numeric features use IterativeImputer (sklearn). Categorical variables retain the translated labels.

Requirements

Python ≥ 3.12
pandas, numpy
pyreadstat
scikit-learn
PyYAML
matplotlib
PyAgrum (including pyagrum.skbn, pyagrum.lib.notebook, etc.)
SciPy
ipywidgets

Install manually (pip install …) or via a requirements file if you maintain one.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cache		cache
config		config
lib		lib
logs		logs
src		src
.Rhistory		.Rhistory
.gitignore		.gitignore
.vdoc.r		.vdoc.r
LICENSE		LICENSE
README.md		README.md
logo-vci-bayes.png		logo-vci-bayes.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VCI-Bayes-Explore

License & Citation

Quick start

Data preparation in detail

1. Configure file locations

2. Run the preprocessing script

Running the Bayesian-network notebook

Repository layout

How the pipeline works (for the technically curious)

Requirements

About

Uh oh!

Releases

Packages

Languages

License

umcu/VCI-Bayes-Explore

Folders and files

Latest commit

History

Repository files navigation

VCI-Bayes-Explore

License & Citation

Quick start

Data preparation in detail

1. Configure file locations

2. Run the preprocessing script

Running the Bayesian-network notebook

Repository layout

How the pipeline works (for the technically curious)

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages