Welcome to the hist2RNA documentation! This guide will walk you through the process of using hist2RNA to predict gene expression from breast cancer histopathology images. Please follow the instructions below to get started.
- Requirements
- Installation
- Preparing the Data
- Training the Model
- Evaluating the Model
- Predicting Gene Expression
- Advanced Usage
- Troubleshooting
- Frequently Asked Questions (FAQs)
- References
Before you begin, ensure that your system meets the following requirements:
- Python 3.9.2
- Pytorch 2.0
To install hist2RNA, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/hist2RNA.git
-
Change directory to the cloned repository:
cd hist2RNA
-
Install the required packages:
pip install -r requirements.txt
Before training the model, you'll need to prepare your dataset. Ensure that your data is organized into separate folders for training and testing. Each folder should contain subfolders for each class, with the corresponding images inside.
The hist2RNA project expects the dataset to be organized in a specific structure to ensure proper retrieval of images for training and validation. The dataset should be organized into separate folders for each patient, with each patient folder containing 1000+ patches of histopathology images.
Here is an example of the expected directory structure:
dataset/
│
├── patient_01/
│ ├── patch_0001.png
│ ├── patch_0002.png
│ ├── ...
│ └── patch_1000.png
│
├── patient_02/
│ ├── patch_0001.png
│ ├── patch_0002.png
│ ├── ...
│ └── patch_1000.png
│
└── ...
Make sure to organize your dataset according to this structure before running the training script. The training script will process the data accordingly and retrieve the images based on this organization.
The hist2RNA project requires gene expression data to be provided as labels for each patient during training. Each patient should have 138 gene expression values corresponding to their histopathology images.
The gene expression data should be organized in a CSV (Comma Separated Values) file with the following structure:
patient_id,gene_1,gene_2,gene_3,...,gene_50
patient_01, 0.23, 0.56, 0.78,..., 1.32
patient_02, 0.34, 0.67, 0.82,..., 1.45
patient_03, 0.28, 0.54, 0.75,..., 1.28
...
The first row of the CSV file should contain the column names, with the first column being the patient_id
and the subsequent columns being the gene expression values for each gene (gene_1
, gene_2
, ..., gene_50
).
Each subsequent row should contain the patient ID and the gene expression values for each of the 138 genes, separated by commas.
Ensure that your gene expression data is formatted according to this structure before running the training script. The training script will read the gene expression data and associate it with the corresponding patient's histopathology images during training.
Make sure to keep color_normalizer.py
and data_load.py
in the same folder as training_main.py
from color_normalizer import MacenkoColorNormalization
from data_load import PatientDataset
To train the hist2RNA model, use the training_main.py
script as follows:
python training_main.py --slides_dir ./data/slides/ --epochs 50 --batch_size 12 --lr 0.001
This command will train the model using the training data in the ./data/slides/ folder, with 50 epochs and a batch size of 12 with learning rate 0.001
To test the performance of the trained model, use the test_main.py
script:
python test_main.py --slides_dir ./data/slides/ --test_patient_id ./patient_details/test_patient_id.txt --checkpoint_file ./models/hist2RNA_model.pth
This command will evaluate the model using the test data in the ./data/slides/ folder and the trained model saved in ./models/hist2RNA_model.pth.
The following command will generate box plot and AUC-RCH curve.
Make sure that test_main.py
generated result in the following directory.
FILENAME_ACROSS_GENE = './save_result/test_result_across_gene.csv'
FILENAME_ACROSS_PATIENT = './save_result/test_result_across_patient.csv'
Then RUN:
python generate_box_plot.py
In this section, we will explore some advanced usage scenarios and options for the hist2RNA project.
If you want to customize the hist2RNA model architecture, you can modify the model.py
file. This file contains the model definition and allows you to experiment with different layers, activation functions, and other hyperparameters.
To improve the performance of the model, you can apply data augmentation techniques. To do this, modify the load_data.py
script to include data augmentation options when loading the training data. For example:
# Apply data augmentation
self.preprocess = transforms.Compose([
transforms.RandomRotation(20), # rotation_range=20
transforms.RandomResizedCrop(224, scale=(0.85, 1.0), ratio=(0.75, 1.3333333333333333)), # zoom_range=0.15 (approximation)
transforms.RandomHorizontalFlip(), # horizontal_flip=True
transforms.RandomAffine(degrees=0, translate=(0.2, 0.2), shear=15), # width_shift_range, height_shift_range, shear_range
self.color_norm,
self.model_transform,
# If you want normalization, add it here (e.g., transforms.Normalize(mean, std))
])
To leverage the power of pre-trained models, you can use transfer learning. This approach involves using the weights from a pre-trained model as a starting point for training your model. Transfer learning can improve the performance of your model, especially when dealing with limited datasets. To implement transfer learning, modify the main.py or feature_extraction_step_1.py file to include a pre-trained model (e.g., VGG16, ResNet50, ViT etc.) as the base of your model architecture.
PyTorch provides a utility called torch.utils.tensorboard to integrate with TensorBoard. Here's a step-by-step guide: To enable TensorBoard, add the following lines to the train.py script:
pip install tensorboard
from torch.utils.tensorboard import SummaryWriter
import datetime
# Set up the TensorBoard writer:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
writer = SummaryWriter(log_dir)
# Log scalars (like training loss, validation loss, etc.) during your training loop. For example, after each epoch:
for epoch in range(num_epochs):
# Training code here...
train_loss = ...
writer.add_scalar('Train/Loss', train_loss, epoch)
# Validation code here...
val_loss = ...
writer.add_scalar('Validation/Loss', val_loss, epoch)
If you want to visualize more than just scalars, like model weights, gradients, or even images, you can do so with methods like add_histogram, add_image, etc. For example, to log model weights:
for name, param in model.named_parameters():
writer.add_histogram(name, param.clone().cpu().data.numpy(), epoch)
Close the writer at the end of training
writer.close()
In your terminal or command prompt, navigate to the directory containing your script and run:
tensorboard --logdir logs/fit
To fine-tune the hyperparameters of your model, such as the learning rate, batch size, or number of epochs, you can modify the relevant arguments in the main.py script. Experimenting with different hyperparameters can help you optimize the performance of your model.
In this section, we provide answers to some frequently asked questions about the hist2RNA project.
A: To improve the model's performance, you can try the following approaches:
- Increase the size or diversity of the training dataset.
- Apply data augmentation techniques (see the Advanced Usage section).
- Use transfer learning with pre-trained models (see the Advanced Usage section).
- Fine-tune the model's hyperparameters, such as the learning rate, batch size, or number of epochs (see the Advanced Usage section).
- Modify the model architecture to include additional or different layers (see the Advanced Usage section).
A: Yes, you can adapt hist2RNA to work with other types of cancer by changing the dataset and adjusting the model architecture as needed. However, the current implementation is tailored specifically for breast cancer histopathology images, and additional modifications might be necessary for optimal performance with other types of cancer.
A: While hist2RNA is designed for histopathology images, it is possible to adapt the model for other imaging modalities. You would need to preprocess the data to ensure compatibility with the model and make any necessary adjustments to the model architecture.
If you still encounter any issues while using hist2RNA, please refer to the README.md file, check the existing issues, or create a new issue with a detailed description of the problem.
Below are some key references and resources for the hist2RNA project:
- Pytorch: https://www.tensorflow.org/](https://pytorch.org/)
- TCGA data portal: https://portal.gdc.cancer.gov/
- cBioPortal: http://www.cbioportal.org/
- Andrew Janowczyk's guide to downloading TCGA digital pathology images: http://www.andrewjanowczyk.com/download-tcga-digital-pathology-images-ffpe/