This repository contains the codebase for Pathology Foundation Models for Ovarian Cancer Subtype Classification in Whole-slide Images. We used the dataset from the UBC Ovarian Cancer Subtype Classification and Outlier Detection (UBC-OCEAN) competition - 2023 for classification of five ovarian cancer subtypes from histopathology whole slide images (WSI).
Development environment:
Python 3.11.5|CUDA 11.8|Pytorch 2.2.0|Tensorflow 2.14.0
Training and feature extraction were performed on an NVIDIA T4 GPU (16GB memory).
This codebase is adapted from the CLAM project[1]. All the additional scripts for reproducibility are located in the scripts directory. It assumes that you have a working virtual environment (e.g. conda or virtualenv) with Python 3.11 and Pytorch 2.2.0 installed.
- Install PyTorch and Torchvision following the official instructions, e.g.:
pip install torch torchvision- Install required dependencies:
pip install -r requirements.txt- [Optional] For REMEDIS models, install TensorFlow 2 and TensorFlow Hub:
pip install "tensorflow>=2.0.0"
pip install --upgrade tensorflow-hubTo use the SVM loss[2], please navigate to the smooth-topk directory and install with the following command:
python setup.py installThe ovarian cancer dataset (totaling 795 GB before unzipping), can be downloaded from the UBC-OCEAN competition[3] on Kaggle under the CC BY-NC-ND 4.0 license. It includes five classes of ovarian cancer subtypes: high-grade serous carcinoma (HGSC), clear-cell ovarian carcinoma (CC), endometrioid (EC), low-grade serous (LC), and mucinous carcinoma (MC).
Assuming the dataset is downloaded and unzipped in /path/to/UBC-OCEAN. Follow these steps for preparation:
- Convert
.pngto pyramidal.tifformat (total output ~3.2 TB):
python ./scripts/convert_png_to_tif.py \
--png_dir /path/to/UBC-OCEAN/train_images \ ## original png images
--output_dir /path/to/UBC_OCEAN_WSI/tiff \ ## output tiff directory
--train_csv /path/to/UBC-OCEAN/train.csv ## train csv file- Generate dataset CSV
This step is optional as the dataset CSV is already provided in the dataset_csv directory. To reproduce the CSV file creation:
python ./scripts/make_dataset_csv_UBC.py --train_csv /path/to/UBC-OCEAN/train.csv- Extract patch coordinates for each WSI:
python create_patches_fp.py \
--source path/to/UBC_OCEAN_WSI/tiff \
--save_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
--step_size 224 \
--patch_size 224 \
--seg \
--stitch \
--patch- Generate training, validation, and testing splits
This step is optional as the split csv files are already provided under the splits directory. To reproduce the split creation:
python create_splits_seq.py --task task_1_UBC_OCEAN_WSI --seed 2024For feature extraction using ResNet baseline models, follow the instructions below for each variant:
python extract_features_fp_resnet.py \
--data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
--data_slide_dir /path/to/UBC_OCEAN_WSI/tiff \
--slide_ext .tif \
--csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
--feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/resnet50 \ ## path to save the extracted features
--model resnet50 \
--batch_size 256python extract_features_fp_resnet.py \
--data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
--data_slide_dir /path/to/UBC-OCEAN-WSI/tiff \
--slide_ext .tif \
--csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
--feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/resnet152 \
--model resnet152 \
--batch_size 256Where:
data_h5_dir: the directory containing the preprocessed data for the current task.data_slide_dir: the directory containing the WSI images.slide_ext: the file extension of the WSI images.csv_path: the path to the CSV file containing the slide names and their corresponding cancer subtype labels.feat_dir: the directory to save the extracted features.model: the ResNet model variant to use for feature extraction.
Try adding
--compilefor accelerated feature extraction.
To use pre-trained models locally, download the weights from PyTorch model zoo to the workdir directory. These weights will be loaded from there.
For ResNet-50:
wget -O ./workdir/resnet50-19c8e357.pth https://download.pytorch.org/models/resnet50-19c8e357.pthFor ResNet-152:
wget -O ./workdir/resnet152-b121ed2d.pth https://download.pytorch.org/models/resnet152-b121ed2d.pth Start by downloading the PLIP[4] model with the following command, which saves it to the workdir directory:
python ./scripts/download_plip.pyTo extract features using the PLIP model:
python extract_features_fp_plip.py \
--data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
--data_slide_dir path/to/UBC-OCEAN-WSI/tiff \
--slide_ext .tif \
--csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
--feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/plip \
--batch_size 256We are not permitted to redistribute either the REMEDIS[5] model weights or image features extracted using the model weights according to its Terms of Service. However, one can request access to the model weights at the Medical AI Research Foundations PhysioNet after registering as a credentialed user acknowledging the usage license.
Once the model weights are obtained following the instruction, the features can be extracted using the following command, for example, for path-50x1-remedis-m:
python extract_features_fp_remedis.py \
--data_h5_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI \
--data_slide_dir path/to/UBC-OCEAN-WSI/tiff \
--slide_ext .tif \
--csv_path ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
--feat_dir ./data/CLAM_preprocessed/UBC_OCEAN_WSI/path-50x1-remedis-m \
--remedis_model path-50x1-remedis-m \
--remedis_weights path/to/medical-ai-research-foundation/1.0.0/path-50x1-remedis-m \
--batch_size 256Where:
remedis_model: the name of the REMEDIS model variant.remedis_weights: the path to the REMEDIS model weights.
All pathology REMEDIS models listed here are supported.
Note: Loading the REMEDIS model requires TensorFlow 2 and TensorFlow Hub.
To train a classifier using the extracted features:
python main.py \
--data_root_dir ./data/CLAM_preprocessed \
--results_dir ./data/CLAM_results \
--task task_1_UBC_OCEAN_WSI \
--split_dir ./splits/task_1_UBC_OCEAN_WSI_100 \
--model_type clam_sb \
--model_size $model_size \
--feat_name $feat_name \
--exp_code clam_sb_${model_size}_dropout \
--log_data \
--early_stopping \
--bag_loss ce \
--inst_loss svm \
--weighted_sample \
--drop_outWhere:
feat_namespecifies the feature extractor used (e.g.,resnet50,resnet152,plip,path-50x1-remedis-m,path-152x2-remedis-m).model_sizevaries based on the feature extractor. For example,plipcomes with optionsplip_smallandplip_big. This is consistent across ResNet and REMEDIS models.results_dir: the master directory for saving the trained models.split_dir: the directory containing the split csv files.exp_code: the directory name for saving the trained model.log_data: Enable TensorboardX logging (optional).bag_loss: the loss function for the bag-level classification. Options includece(cross-entropy) andsvm(SVM).inst_loss: the loss function for the instance-level clustering. Options includece(cross-entropy) andsvm(SVM).weighted_sample: Enable weighted sampling for the training data (optional).drop_out: Enable dropout (p=0.25).
Note: The SVM instance loss option (
--inst_loss svm) requiressmooth-topkto be installed beforehand.
To evaluate the trained models:
python eval.py \
--data_root_dir ./data/CLAM_preprocessed \
--results_dir ./data/CLAM_results/task_1_UBC_OCEAN_WSI \
--task task_1_UBC_OCEAN_WSI \
--splits_dir ./splits/task_1_UBC_OCEAN_WSI_100 \
--model_type clam_sb \
--feat_name $feat_name \
--model_size $model_size \
--models_exp_code clam_sb_${model_size}_dropout_s2024 \
--save_exp_code clam_sb_${model_size}_dropout \
--save_dir ./data/CLAM_evaluation/task_1_UBC_OCEAN_WSI \
--split test \
--drop_outWhere:
model_sizeandfeat_nameremain consistent as in the training command.results_dir: the master directory containing the trained models for the current task.models_exp_code: the directory name of the saved trained model.save_exp_code: the directory name for saving the evaluation results.save_dir: the master directory for saving the evaluation results for the current task.
We provide a script create_heatmap_plip.py adapted from create_heatmap.py for generating heatmaps using the PLIP model. This process requires both a .yaml config file and a CSV file containing the slide names along with their corresponding cancer subtype labels. Optional parameters, such as patch size, can also be specified in the CSV file. Sample files and further guidance are available in the heatmaps directory.
To reproduce the creation of the heatmap csv file for the testing set:
python ./scripts/make_heatmap_csv_test.py \
--process_csv ./data/CLAM_preprocessed/UBC_OCEAN_WSI/process_list_autogen.csv \
--split_csv ./splits/task_1_UBC_OCEAN_WSI_100/splits_0.csv \
--dataset_csv ./dataset_csv/UBC_OCEAN_WSI.csv \
--save_csv ./heatmaps/process_list.csvFor generating the .yaml configuration files needed for heatmap generation for the trained models:
python ./scripts/create_heatmap_config_from_results.py \
--results_dir ./data/CLAM_results \
--task task_1_UBC_OCEAN_WSI \
--data_dir path/to/UBC_OCEAN_WSI/tiff \
--overlap 0 \
--patch_size 224 \
--num_workers 2The generated yaml config file will be saved to the heatmaps/configs directory, named after the corresponding checkpoint directory.
To generate heatmaps with PLIP as feature extractor:
python create_heatmaps_plip.py --config_file ./heatmaps/configs/clam_sb_plip_small_dropout_0.yamlWe appreciate the authors of CLAM, PLIP, and REMEDIS for open-sourcing the codebase and the pre-trained models for feature extraction. We also thank the organisers of the UBC-OCEAN competition for making the dataset publicly available.
-
Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood,F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering 5(6), 555–570 (2021)
-
Berrada, L., Zisserman, A., Kumar, M.P.: Smooth loss functions for deep top-k classification. International Conference on Learning Representations (2018)
-
Bashashati, A., Farahani, H., OTTA Consortium, Karnezis, A., Akbari, A., Kim, S., Chow, A., Dane, S., Zhang, A., Asadi, M.: UBC Ovarian Cancer Subtype Classification and Outlier Detection (UBC-OCEAN) (2023)
-
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T.J., Zou, J.: A visual–language foundation model for pathology image analysis using medical twitter. Nature Medicine 29(9), 2307–2316 (2023)
-
Azizi, S., Culp, L., Freyberg, J., Mustafa, B., Baur, S., Kornblith, S., Chen, T., Tomasev, N., Mitrović, J., Strachan, P., et al.: Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering pp. 1–24 (2023)