index.Rmd

--- 
title: "Tutorial of RNA-seq tumor immunity analysis"
author: "Yang Liu, Jennifer Altreuter, Yang Lin"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
#documentclass: book
#bibliography: [book.bib, packages.bib]
#biblio-style: apalike
link-citations: yes
description: "This is a minimal example of using the bookdown package to write a book. The output format for this example is bookdown::gitbook."
---

#  Introduction {#intro}

This tutorial describes how to perform integrative computational analysis of **tumor immunity** using bulk RNA-sequencing (RNA-seq) data. We will focus on inferring immune infiltration levels, immune repertoire features, immune response and HLA type from a gene expression profile.\

To support the uniform analysis of bulk RNA-sequencing data, we developed a RNA-seq immune analysis pipeline named **RIMA** that is available at https://github.com/liulab-dfci/RIMA_pipeline. \

Tumor RNA-seq has become an important technique for molecular profiling and
immune characterization of tumors. RNA-seq Immune Analysis (RIMA) performs
integrative computational modeling of the tumor microenvironment from bulk tumor
RNA-seq data, which has the potential to offer essential insights to cancer
immunology and immune-oncology studies.\

```{r fig.align='center', echo=FALSE, fig.cap='Flowchat of RIMA pipeline'}
knitr::include_graphics('images/RIMA.png', dpi = NA)
```

 The pre-processing module includes four main procedures:\
 <ul>
 <li>**Read mapping**</li>
 <li>**Quality control**</li>
 <li>**Gene quantification**</li>
 <li>**Batch effect removal**</li>
 </ul>
 The downstream analysis includes seven modules:
 <ul>
 <li>**Differential expression analysis**</li>
 <li>**Immune repertoire inference**</li>
 <li>**Immune infiltration estimation**</li>
 <li>**Immunotherapy response prediction**</li>
 <li>**Gene fusion**</li>
 <li>**Microbiome characterization**</li>
 <li>**Neoantigen detection**</li>
 </ul>
 
**Available Tools Checklist** 


| **Methods** | **Description** | 
| :---: | :---: | :---: |
|  | **---PREPROCESSING---**| 
| STAR | Spliced Transcript Alignment to a Reference | 
| Salmon | Gene Quantification | 
| RSeQC | High Throughput Sequence Data Evaluation | 
| batch_removal| Remove Batch Effects Using Limma | 
|  | **---DIFFERENTIAL EXPRESSION---** |  
| DESeq2 | Gene Differential Expression Analysis | 
| GSEA | Gene Set Enrichment Analysis | 
| ssGSEA | Single-sample GSEA | 
|  | **---IMMUNE REPERTOIRE---** |  
| TRUST4 | TCR and BCR Sequence Analysis | 
|  | **---IMMUNE INFILTRATION---** |  
| ImmuneDeconv | Cell Components Estimation | 
|  | **---IMMUNE RESPONSE---** |  
| MSIsensor2 | Microsatellite Instability (MSI) Detection | 
| TIDEpy | T cell dysfunction and exclusion prediction | 
|  | **---FUSION---** |  
| STAR-Fusion | Identify the fusion gene pairs | 
|  | **---MICROBIOME---** |  
| Centrifuge | Bacterial Abundance Detection |
|  | **---NEO-ANTIGEN---** |  |
| arcasHLA | HLA Class I and Class II Genotyping | 


# How to run RIMA

<h2>***“Please make sure you adhere to NIH and HIPAA security standards when uploading controlled access data to a cloud or high computing environment.”***</h2>

## Install RIMA and Set Up the Running Environment ##

To run RIMA, you will create a working directory. In the working directory, you will install the pipeline code and other required software.

**Running machine requirements**
<ul>
<li> Cores and memory --  We usually run RIMA on a machine with at least 64 GB of memory to ensure a smooth read alignment run using STAR. </li>
<li>Storage -- You will need to reserve enough storage for the following: \
    1. Reference files and pipeline code -- 65 GB \
    2. Raw data -- we recommend reserving five times the size of your data files.  For example, if your fastq files are 50GB each, then you should allot 250GB for each fastq file. Zipped fastq files are recommended. \
    3. Required software -- reserve ~20GB for installation of miniconda, mamba and snakemake.  
</li>
</ul>

**Prepare a working directory**

```
mkdir RIMA
```

**Get the pipeline code**

Download the RIMA_pipeline folder to your working directory. 

```
git clone  https://github.com/liulab-dfci/RIMA_pipeline.git
```

**Install miniconda**

Download and configure your conda environment. Follow the prompts on the installer screens.\  

#Note: You will need to install miniconda3 into the RIMA directory.  This means you need to override the default directory when miniconda3 is installed.\

#If you have conda installed elsewhere, you will still need a local copy in your RIMA directory in order for the environment installation shell program to work.  (see below)\

```
#go to your RIMA directory if you are not there already
cd RIMA

#obtain the Miniconda3 code and install in the RIMA directory

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh  
# !Don't forget to override the default directory for miniconda3!
# Install miniconda3 into your RIMA directory.

conda list  #to see whether you have successfully installed conda
```

**Set up the running environment**

The RIMA environment can be set up by running a shell command.  In the command, you need to specify the platform you are using -- Amazon Web Services (AWS) or Google Cloud Platform (GCP).  This process may take ten to fifteen minutes.   The user will need to answer two questions during the process -- one at the beginning of the program and one near the end -- to give permissions to install miniconda and snakemake.

#NOTE: before running the environment shell command, make sure to set the path for CONDA_ROOT
```
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH
```

Run the shell program:

```
#shell command for 
bash ./RIMA_environment.sh -p {platform --AWS, GCP}

#example for AWS
bash ./RIMA_environment.sh -p AWS
```

**Activate the RIMA environment**

Based on the conda environments information, the main environment "RIMA" can be activated as below.

```
#set the path if you this is a new login to your shell
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH

# activate the RIMA conda environment
source activate RIMA
```

## Download pre-built references

A pre-prepared RIMA reference folder can be downloaded using the code below. 

If you want to prepare a customized reference, you can follow [this tutorial](https://liulab-dfci.github.io/RIMA/customize-your-own-reference.html) to build your own reference.

The following link contains the **hg38** reference downloaded from <a href = 'https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files'>Genomic Data Commons</a> using version 22 index and annotation files.  
 
```
cd RIMA
wget http://cistrome.org/~lyang/ref.tar.gz  

# unzip the reference  
tar -zxvf ref.tar.gz

# remove the reference zip file to save some space (optional)
rm ref.tar.gz 
```


**Optional step** 


If you download your RIMA reference folder into a directory other than RIMA, you have to change the symbolic link for the ref_files under the rnaseq_pipeline folder:
For example: if you downloaded the RIMA reference folder into /mnt/tutorial/ref_files, use the ln command to change the symbolic link. 


```
cd RIMA

# remove the current link of ref_files
rm ref_files

# create a new symoblic link to your reference folder
ln -s /mnt/tutorial/ref_files
```

### Video tutorial
<iframe width="560" height="315" src="https://www.youtube.com/embed/XjggT6c0CW4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


## Prepare input files 

**Metasheet.csv**

Metasheet.csv is a comma-delimited file that resides in the RIMA_pipeline folder. The Metasheet.csv should contain phenotypic information about your samples that can be used for downstream analysis. Your metasheet should contain **Two Required Columns** (SampleName, Group) in comma-delimited format. 

For this tutorial, RIMA uses demo data from Zhao, J., Chen, A.X., Gartrell, R.D. et al. Immune and genomic correlates of response to anti-PD-1 immunotherapy in glioblastoma. Nat Med 25, 462–469 (2019). <a href= 'https://www.nature.com/articles/s41591-019-0349-y'>*https://doi.org/10.1038/s41591-019-0349-y* </a>. We selected 12 samples from this trial and included 6 treatment naive samples (3 responder and 3 non-responders) for cohort level comparison. 

<ul>
 <li>**SampleName:** Sequencing sample id.</li>
 <li>**PatName**: Patient id</li>
 <li>**Group**: Sample included for cohort level comparison. 'NA' indicates samples that should not be included in the comparison. In the example below, we compare responders(R) and non-responders(NR) using only treatment naive (pre) patients.   </li>
 </ul>
 
You can add columns to the metasheet in order to compare other phenotypes e.g. columns for Timing Age, Sex etc.

```
SampleName,PatName,Group,Age,Tissue,Gender,Timing,Responder,TumorLocation,OngoingTreatment,PFS,Survival,OS,SampleId,syn_batch
SRR8281218,P20,NR,63,Brain,male,Pre,NR,temporal,no,110,1,278,4790-NL-AS,1
SRR8281219,P20,NA,63,Brain,male,Post,NR,temporal,no,110,1,278,4975-NL-AS,1
SRR8281226,P53,NR,70,Brain,male,Pre,NR,frontal,Yes,83,1,337,3981-NL-AS,2
SRR8281236,P53,NA,70,Brain,male,Post,NR,frontal,Yes,83,1,337,4760-D1,2
SRR8281233,P56,NR,54,Brain,female,Pre,NR,Parietal,no,83,1,83,4341-D3,3
SRR8281230,P56,NA,54,Brain,female,Post,NR,Parietal,no,83,1,83,4680-A1,3
SRR8281244,P100,NA,31,Brain,male,Post,R,Frontal-crossesmidline,yes,151,0,615,4956-NL-AS,2
SRR8281245,P100,R,31,Brain,male,Pre,R,Frontal-crossesmidline,yes,151,0,615,4595-G1,2
SRR8281243,P101,R,32,Brain,male,Pre,R,frontal,no,0,1,414,3542-NL-AS,2
SRR8281238,P102,NA,64,Brain,female,Post,R,Temporal-anterior,yes,519,0,539,5094-NL-AS,1
SRR8281251,P101,NA,32,Brain,male,Post,R,frontal,no,0,1,414,4943-A1,2
SRR8281250,P102,R,64,Brain,female,Pre,R,Temporal-anterior,yes,519,0,539,4443-NL-AS,1
```

**Config.yaml**

In the RIMA_pipeline folder, we have prepared a config.yaml as a template to run the pipeline. The config file is divided into three sections: 
<ul>
 <li>  Fixed and user-defined parameters. </li>
 <li>  Cohort level analysis parameters. </li>
 <li>  Samples list. </li>
</ul>
You should provide these parameters with column names that match the columns in **metasheet.csv**.

Below is an example of a config.yaml file. You can set the patient name as the sample name. Note: Please make sure that sample names match the metasheet. Currently, only fastq files are accepted as input (including fastq.gz).

```
#########Fixed and user-defined parameters################
metasheet: metasheet.csv  # Meta info 
ref: ref.yaml             # Reference config 
assembly: hg38
cancer_type: GBM          #TCGA cancer type abbreviations
rseqc_ref: house_keeping  #Option: 'house_keeping' or 'false'. 
                          #By default, a subset of housekeeping genes is used by RSeQC to assess alignment quality.  
                          #This reduces the amount of time needed to run RSeQC.  
mate: [1,2]               #paired-end fastq format, we recommend naming paired-end reads with _1.fq.gz and _2.fq.gz


#########Cohort level analysis parameters################
design: Group             # Condition on which to do comparsion (as set up in metasheet.csv)
Treatment: R              # Treatment use in DESeq2, corresponding to positive log fold change
Control: NR               # Control use in DESeq2, corresponding to negative log fold change
batch: syb_batch          # Options: 'false' or a column name from the metasheet.csv.  
                          # If set to a column name in the metasheet.csv, the column name will be used for batch effect analysis (limma).  
                          # It will also be used as a covariate for differential analysis (DESeq2) to account for batch effect.  

pre_treated: false        # Option: true or false. 
                          # If set to false, patients are treatment naive.  
                          # If set to true, patients have received some form of therapy prior to the current study.


#########Samples list##############
samples:
  SRR8281218:
    - /mnt/zhao_trial/data/SRR8281218_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281218_2.fastq.gz
  SRR8281219:
    - /mnt/zhao_trial/data/SRR8281219_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281219_2.fastq.gz
  SRR8281226:
    - /mnt/zhao_trial/data/SRR8281226_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281226_2.fastq.gz
  SRR8281236:
    - /mnt/zhao_trial/data/SRR8281236_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281236_2.fastq.gz
  SRR8281230:
    - /mnt/zhao_trial/data/SRR8281230_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281230_2.fastq.gz
  SRR8281233:
    - /mnt/zhao_trial/data/SRR8281233_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281233_2.fastq.gz
  SRR8281244:
    - /mnt/zhao_trial/data/SRR8281244_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281244_2.fastq.gz
  SRR8281245:
    - /mnt/zhao_trial/data/SRR8281245_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281245_2.fastq.gz
  SRR8281243:
    - /mnt/zhao_trial/data/SRR8281243_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281243_2.fastq.gz
  SRR8281251:
    - /mnt/zhao_trial/data/SRR8281251_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281251_2.fastq.gz
  SRR8281238:
    - /mnt/zhao_trial/data/SRR8281238_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281238_2.fastq.gz
  SRR8281250:
    - /mnt/zhao_trial/data/SRR8281250_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281250_2.fastq.gz
```

**execution.yaml**

Use **execution.yaml** to control which modules to run in RIMA. The preprocess module outputs are required for optional downstream analysis modules. Details of each module will be introduced in the next chapters.

```
##Note: The preprocess individual and cohort modules are necessary to obtain alignment and quality results. 
##Run the remaining modules only after these two modules.
preprocess_individual: true
preprocess_cohort: true

##Optional modules
##Note: The below modules are specialized modules, each dealing with specific targets. 
##Make sure to run both the individual and cohort parts of each module 
##to get all results.

##Individual runs
mutation_individual: false
immune_response_individual: false
immune_repertoire_individual: false
microbiome_individual: false
neoantigen_individual: false


##Cohort runs
differential_expression_cohort: false
immune_infiltration_cohort: false
mutation_cohort: false
immune_response_cohort: false
immune_repertoire_cohort: false
microbiome_cohort: false
neoantigen_cohort: false

```

**ref.yaml**
The ref.yaml file provides the paths for all reference files.  If you downloaded the pre-built reference files, you should not need to change the ref.yaml file.

```
#REF.yaml - File to contain paths to static reference files that RNA-Seq requires
#           for its analysis.
# NOTE: these are already pre-filled for hg38 
# NOTE: organized by assemblies
# 
# !VERY IMPORTANT: you can OVERRIDE any of these values in config.yaml!

hg38:
###annotation path
  fasta_path: ./ref_files/GRCh38.d1.vd1.fa 
  gtf_path: ./ref_files/gencode.annotation.gtf
  bed_path: ./ref_files/RSeQC_bed/refseqGenes.bed
  housekeeping_bed_path: ./ref_files/RSeQC_bed/housekeeping_refseqGenes.bed
  trust4_reper_path: ./ref_files/TRUST4/hg38_bcrtcr.fa
  trust4_IMGT_path: ./ref_files/TRUST4/human_IMGT+C.fa

###index path
  star_index: ./ref_files/star_index
  star_fusion_index: ./ref_files/fusion_index/ctat_genome_lib_build_dir
  salmon_index:  ./ref_files/salmon_index/transcripts.fa
  msisensor_index: ./ref_files/msisensor_index/hg38/microsatellite.list
  centrifuge_index: ./ref_files/centrifuge_index/p_compressed+h+v
  annotation_pyprada: ./ref_files/pyPRADA/gencode.canonical.gene.exons.tab.txt

###tool path
  msisensor2_path: ./ref_files/msisensor2
  trust4_path: ./ref_files/TRUST4
  arcasHLA_path: ./ref_files/arcasHLA
  prada_path: ./ref_files/pyPRADA/pyPRADA_1.2
```
### Video tutorial
<iframe width="560" height="315" src="https://www.youtube.com/embed/o5QA51nZVtI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


## Running RIMA 

**Check the path of your conda env**

```
conda env list 
# conda environments:
#
base                  *  /home/ubuntu/miniconda3
centrifuge_env           /home/ubuntu/miniconda3/envs/centrifuge_env
fusion_env               /home/ubuntu/miniconda3/envs/fusion_env
rnaseq                   /home/ubuntu/miniconda3/envs/rnaseq
rseqc_env                /home/ubuntu/miniconda3/envs/rseqc_env
stat_perl_r              /home/ubuntu/miniconda3/envs/stat_perl_r
```

**Check the pipeline with a dry run to ensure correct script and data usage.**

```
snakemake -s RIMA.snakefile -np 
```

**Submit the job.**

Alignment and some of the other modules of RIMA will take several hours to run.  It is recommended that you run RIMA in the background using a command such as nohup as below.

```
nohup time snakemake -p -s RIMA.snakefile -j 4 > RIMA.out &
```

note: Argument -j sets the cores for parallel runs (e.g. '-j 4' can run 4 jobs in parallel at the same time.). Argument -p prints the command in each rule. Note: Here, output log records the run information.  A user may run one module at a time to obtain a record of each module's output log.


## Output
Output folders are generated in a folder called "analysis". The output structure of each module will be introduced in the next chapters.


```{r, include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
  .packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')
```