This project utilizes texture features extracted from hematoxylin and eosin (H&E) stained whole slide images (WSIs) in order to predict microsatellite instability.
Contact: Bahar Tercan btercan@systemsbiology.org
Reference:https://www.biorxiv.org/content/10.1101/2025.03.30.646246v1.full.pdf
Using the already segmented and MSI labelled tiles from the Kather dataset that can be found at DOI: 10.5281/zenodo.2530834 or at https://zenodo.org/records/2530835, the following example can be implemented.
In order to use XGBoost, an NVIDIA A100 Tensor Core GPU machine with CUDA toolkit enabled was used through Google Colab.
- Clone repository.
git clone https://github.com/IlyaLab/TextureAnalysis.git
cd TextureAnalysis
- Install dependencies.
NOTE: xgboost changes greatly between versions, so usingxgboost==2.1.4
is the only guarenteed version that works for this example.
conda evn create -f environment.yaml
- In a Jupyter notebook, set the paths to tile directories, seperated by labels (already done in Kather dataset).
MSI_dir = r"/path/to/TCGA/CRC/MSI/tiles"
MSS_dir = r"/path/to/TCGA/CRC/MSS/tiles"
- Run
main
with the following parameters in order to predict MSI in TCGA-CRC (colorectal cancer) cohort. This will produce a per-tile ROC curve, a bar graph of the feature importance of the model, and a per-sample ROC curve.
main(MSI_dir, MSS_dir, sample_size=5000, n_jobs=-1, cohort_name='TCGA-CRC', validation=True, subsets=True, model='CRC', study='TCGA', ML='XGBoost')
Apache-2.0 license