Abstract:
Deep learning based analysis of histopathology images shows promise in advancing understanding of tumor progression, tumor micro-environment, and their underpinning biological processes. So far, these approaches have focused on extracting information associated with annotations. In this work, we ask how much information can be learned from the tissue architecture itself.
We present an adversarial learning model to extract feature representations of cancer tissue, without the need for manual annotations. We show that these representations are able to identify a variety of morphological characteristics across three cancer types: Breast, colon, and lung. This is supported by 1) the separation of morphologic characteristics in the latent space; 2) the ability to classify tissue type with logistic regression using latent representations, with an AUC of 0.97 and 85% accuracy, comparable to supervised deep models; 3) the ability to predict the presence of tumor in Whole Slide Images (WSIs) using multiple instance learning (MIL), achieving an AUC of 0.98 and 94% accuracy.
Our results show that our model captures distinct phenotypic characteristics of real tissue samples, paving the way for further understanding of tumor progression and tumor micro-environment, and ultimately refining histopathological classification for diagnosis and treatment.
@InProceedings{quiros2021adversarial,
title={Adversarial learning of cancer tissue representations},
author={Quiros, Adalberto Claudio and Coudray, Nicolas and Yeaton, Anna and Suhnhem, Wisuwat and Murray-Smith, Roderick and Tsirigos, Aristotelis and Yuan, Ke},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
year={2021},
organization={Springer}
}
-
- (a) Real tissue images, (b) Reconstructed images with PathologyGAN.
-
- Uniform Manifold Approximation and Projection (UMAP) vectors of PathologyGAN's latent representations, breast cancer tissue from NKI and VGH (left image) and colorectal cancer tissue from NCT (right image). Breast cancer tissue images are labeled using cancer cell counts, class 8 accounting for the largest cell number. Colorectal cancer tissue images are labeled based on their tissue type.
-
Latent Representations on Multiple Instance Learning:
- Uniform Manifold Approximation and Projection (UMAP) vectors of lung cancer tissue representations. We labeled each patch of the WSI with the corresponding label subject to presence of tumor in the WSI, and highlight images and representations where the attention-based deep MIL focuses to predict the outcome.
H&E breast cancer databases from the Netherlands Cancer Institute (NKI) cohort and the Vancouver General Hospital (VGH) cohort with 248 and 328 patients respectevely. Each of them include tissue micro-array (TMA) images, along with clinical patient data such as survival time, and estrogen-receptor (ER) status. The original TMA images all have a resolution of 1128x720 pixels, and we split each of the images into smaller patches of 224x224, and allow them to overlap by 50%. We also perform data augmentation on these images, a rotation of 90 degrees, and 180 degrees, and vertical and horizontal inversion. We filter out images in which the tissue covers less than 70% of the area. In total this yields a training set of 249K images, and a test set of 62K.
We use these Netherlands Cancer Institute (NKI) cohort and the Vancouver General Hospital (VGH) previously used in Beck et al. [1]. These TMA images are from the Stanford Tissue Microarray Database[2]
You can find a pre-processed HDF5 file with patches of 224x224x3 resolution here, each of the patches also contains labeling information of the estrogen receptor status and survival time.
The H&E colorectal cancer dataset can be found here. The dataset from National Center for Tumor diseases (NCT, Germany) [3] provides tissue images of 224×224 resolution with an as- sociated type of tissue label: Adipose, background, debris, lymphocytes, mucus, smooth muscle, normal colon mucosa, cancer-associated stroma, and colorectal adenocarcinoma epithelium (tumor). The dataset is divided into a training set of 100K tissue tiles and 86 patients, and a test set of 7K tissue tiles and 50 patients, there is no overlapping patients between train and test sets.
The H&E lung cancer dataset can be found at The Cancer Genome Atlas (TCGA). It contains samples with adenocarcinoma (LUAD), squamous cell carcinoma (LUSC), and normal tissue, composed by 1807 Whole Slide Images (WSIs) of 1184 patients. We make use of the pipeline provided in Coudray et al. [4], diving each WSI into patches of 224x224 and filtering out images with less than 50% tissue in total area and apply stain normalization [5]. In addition, we label each slide as tumor and non-tumor depending on the presence of lung cancer in the tissue. Finally, we split the dataset into a training set of 916K tissue patches and 666 patients, and a test set of 569K tissue patches and 518 patients, with no overlapping patients between both sets. We use this dataset to apply multiple instance learning (MIL) over latent representations, testing the performance to predict the presence of tumor in the WSI.
[1] Beck, A.H. and Sangoi, A.R. and Leung, S. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Science translational medicine, 2018.
[2] Robert J. Marinelli, Kelli Montgomery, Chih Long Liu, Nigam H. Shah, Wijan Prapong, Michael Nitzberg, Zachariah K. Zachariah, Gavin J. Sherlock, Yasodha Natkunam, Robert B. West, Matt van de Rijn, Patrick O. Brown, and Catherine A. Ball. The Stanford Tissue Microarray Database. Nucleic Acids Res 2008 36(Database issue): D871-7. Epub 2007.
[3] Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorectal cancer and healthy tissue, 2018.
[4] Coudray, N., Ocampo, P.S., Sakellaropoulos, T., Narula, N., Snuderl, M., Fenyo ̈, D., Moreira, A.L., Razavian, N., Tsirigos, A.: Classification and mutation predic- tion from non–small cell lung cancer histopathology images using deep learning. Nature Medicine, 2018.
[5] Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P.: Color transfer be- tween images. IEEE Computer Graphics and Applications, 2001.
h5py 2.9.0
numpy 1.16.1
pandas 0.24.1
scikit-image 0.14.2
scikit-learn 0.20.2
scipy 1.2.0
seaborn 0.9.0
sklearn 0.0
tensorboard 1.12.2
tensorflow 1.12.0
tensorflow-probability 0.5.0
python 3.6.7
You can find pre-trained weights for the breast cancer trained model here and colorectal cancer trained model here
You can find a pre-processed HDF5 file with patches of 224x224x3 resolution of the H&E breast cancer dataset here. Place the 'vgh_nki' under the 'datasets' folder in the main 'Adversarial-learning-of-cancer-tissue-representations' path.
Each model was trained on an NVIDIA Titan RTX 24 GB for 45 epochs, approximately 72 hours.
usage: run_pathgan_encoder.py [-h] [--model MODEL] [--img_size IMG_SIZE]
[--img_ch IMG_CH] [--dataset DATASET]
[--marker MARKER] [--z_dim Z_DIM]
[--epochs EPOCHS] [--batch_size BATCH_SIZE]
[--check_every CHECK_EVERY] [--restore]
[--report] [--main_path MAIN_PATH]
[--dbs_path DBS_PATH]
PathologyGAN Encoder trainer.
optional arguments:
-h, --help show this help message and exit
--model MODEL Model name.
--img_size IMG_SIZE Image size for the model.
--img_ch IMG_CH Number of channels for the model.
--dataset DATASET Dataset to use.
--marker MARKER Marker of dataset to use.
--z_dim Z_DIM Latent space size.
--epochs EPOCHS Number epochs to run: default is 45 epochs.
--batch_size BATCH_SIZE Batch size, default size is 64.
--check_every CHECK_EVERY Save checkpoint and generate samples every X epcohs.
--restore Restore previous run and continue.
--report Report latent space figures.
--main_path MAIN_PATH Path for the output run.
--dbs_path DBS_PATH Directory with DBs to use.
- Command example:
python3 run_pathgan_encoder.py
Once you have a trained model you can project images into the latent space, the vector represenetation will be placed in 'results' folder on an H5 file.
usage: project_real_tissue_latent_space.py [-h] --checkpoint CHECKPOINT
--real_hdf5 REAL_HDF5
[--batch_size BATCH_SIZE]
[--z_dim Z_DIM] [--model MODEL]
[--img_size IMG_SIZE]
[--img_ch IMG_CH]
[--dataset DATASET]
[--marker MARKER]
[--dbs_path DBS_PATH]
[--main_path MAIN_PATH]
[--num_clusters NUM_CLUSTERS]
[--clust_percent CLUST_PERCENT]
[--features] [--save_img]
Projection of tissue images onto the GAN's latent space.
optional arguments:
-h, --help show this help message and exit
--checkpoint CHECKPOINT Path to pre-trained weights (.ckt) of PathologyGAN.
--real_hdf5 REAL_HDF5 Path for real image to encode.
--batch_size BATCH_SIZE Batch size.
--z_dim Z_DIM Latent space size.
--model MODEL Model name.
--img_size IMG_SIZE Image size for the model.
--img_ch IMG_CH Image channels for the model.
--dataset DATASET Dataset to use.
--marker MARKER Marker of dataset to use.
--dbs_path DBS_PATH Directory with DBs to use.
--main_path MAIN_PATH Path for the output run.
--features Flag to run features over the images.
--save_img Save reconstructed images in the H5 file.