-
Notifications
You must be signed in to change notification settings - Fork 5
/
08-HLA.Rmd
110 lines (80 loc) · 13.4 KB
/
08-HLA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# HLA Typing & Neoantigens {#HLA}
Tumor mutations can result in altered proteins, which might act as neoantigens and potentially elicit an immune response. Algorithms, such as those available in <a href="https://github.com/griffithlab/pVACtools">pVAC-Tools</a> can be used to predict neoantigen peptides.
In order to elicit an adaptive immune response, a peptide must be presented on a cell's surface bound to major histocompatibility (MHC) protein complexes. The human leukocyte antigen (HLA) complex on chromosome 6 encodes most of the proteins that make up the human MHCs. The HLA complex is highly polymorphic and encodes MHCs that have different propensities to present different peptides. Two major classes of human MHC, class I and class II, (encoded by HLA-I and HLA-II alleles, respectively) are involved in antigen presentation. MHC class I protein complexes are expressed on all normal cells and present internally degraded proteins, while MHC class II protein complexes are normally expressed on professional antigen-presenting cells and present processed external antigens. The antigen binding portions of the MHC class I protein complexes are encoded by the HLA-I genes HLA-A, HLA-B and HLA-C. The antigen binding portions of the MHC class II proteins are encoded by alpha (A) and beta (B) genes in the DP, DQ and DR regions of the HLA complexes. These HLA-II genes include HLA-DQA1, HLA-DQB1 and HLA-DRB1.
Neoantigen prediction algorithms usually integrate peptide/MHC binding predictions using tools such as <a href="https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1">NetMHCpan</a>, <a href= "https://github.com/openvax/mhcflurry">MHCflurry</a>, and <a href="https://pubmed.ncbi.nlm.nih.gov/17608956/">SMMalign</a>. Knowing the HLA type is necessary to identify potential neoantigens for targeted immunotherapy. Current alignment-based HLA typing methods require DNA or RNA sequencing inputs and predict HLA-I class only or both of HLA-I and HLA-II classes. For example, <a href="https://github.com/RabadanLab/arcasHLA">arcasHLA</a>, <a href="https://github.com/ExpressionAnalysis/HLAProfiler">HLAProfiler</a>, and <a href="https://github.com/TRON-Bioinformatics/seq2HLA">seq2HLA</a> have been developed to perform high-solution HLA typing from RNA-seq data. <a href="https://github.com/FRED-2/OptiType">OptiType</a> and <a href="https://pubmed.ncbi.nlm.nih.gov/29858810/">PHLAT</a> are other HLA identification tools for RNA, whole-exome, and whole-genome sequencing data. HLA reference sequences can be obtained from the <a href="http://www.imgt.org/">ImMunoGeneTics (IMGT) database</a>.
## Identify HLA type
HLA Typing is a part of the neoantigen prediction module of RIMA.
RIMA uses <a href="https://github.com/RabadanLab/arcasHLA">arcasHLA</a> to predict HLA types for both MHC Class I & Class II from the bulk RNA-seq data. The sorted alignment BAM files generated by STAR are used for input to arcasHLA. Here we use one sample from the Zhao trial as an example. We first extract the HLA reads from the alignment file:
```
### Extract fastq reads
arcasHLA extract analysis/STAR/SRR8281218/SRR8281218.sorted.bam -t 16 -v --sample SRR8281218 -o analysis/neoantigen/SRR8281218
### Output from extraction
analysis/neoantigen/SRR8281218/SRR8281218.extracted.1.fq.gz
analysis/neoantigen/SRR8281218/SRR8281218.extracted.2.fq.gz
```
Then we identify the HLA alleles using the extracted reads:
```
arcasHLA genotype analysis/neoantigen/SRR8281218/SRR8281218.extracted.1.fq.gz analysis/neoantigen/SRR8281218/SRR8281218.extracted.2.fq.gz -g A,B,C,DQA1,DQB1,DRB1 -t 16 -v -o analysis/neoantigen/SRR8281218
###Output from extraction
analysis/neoantigen/SRR8281218/SRR8281218.genotype.json
cat analysis/neoantigen/SRR8281218/SRR8281218.genotype.json
###
{"A": ["A*26:01:01", "A*03:01:01"], "B": ["B*35:01:01", "B*07:02:01"], "C": ["C*07:02:01", "C*04:01:01"], "DQA1": ["DQA1*02:01:01", "DQA1*03:01:01"], "DQB1": ["DQB1*03:02:01"], "DRB1": ["DRB1*04:02:01", "DRB1*07:01:01"]}
###
```
**Merge individual HLAs**
RIMA also merges the individual HLA results from arcasHLA into a summary file named 'genotypes.tsv':
```
subject A1 A2 B1 B2 C1 C2 DQA11 DQA12 DQB11 DQB12 DRB11 DRB12
SRR8281238 A*01:01:01 A*02:01:01 B*35:01:01 B*08:01:01 C*07:01:01 C*04:01:01 DQA1*05:01:01 DQA1*01:01:01 DQB1*02:01:01 DQB1*05:01:01 DRB1*01:01:01 DRB1*03:01:01
SRR8281233 A*01:01:01 A*02:01:01 B*57:01:01 B*44:02:01 C*05:01:01 C*06:02:01 DQA1*01:02:01 DQA1*03:01:01 DQB1*06:02:01 DQB1*03:02:01 DRB1*04:01:01 DRB1*15:01:01
SRR8281236 A*33:01:01 A*24:02:01 B*14:02:01 B*15:01:01 C*06:02:01 C*08:02:01 DQA1*01:02:02 DQA1*03:01:01 DQB1*03:02:01 DQB1*05:02:01 DRB1*04:03:01 DRB1*16:02:01
SRR8281243 A*01:01:01 A*24:02:01 B*35:02:01 B*41:01:01 C*04:01:01 C*17:01:01 DQA1*01:02:01 DQA1*01:05:01 DQB1*05:01:01 DQB1*06:09:01 DRB1*10:01:01 DRB1*13:02:01
SRR8281251 A*24:02:01 A*01:01:01 B*35:02:01 B*41:01:01 C*04:01:01 C*17:01:01 DQA1*01:02:01 DQA1*01:05:01 DQB1*05:01:01 DQB1*06:09:01 DRB1*10:01:01 DRB1*13:02:01
SRR8281230 A*01:01:01 A*02:01:01 B*57:01:01 B*44:02:01 C*05:01:01 C*06:02:01 DQA1*03:01:01 DQA1*01:02:01 DQB1*03:02:01 DQB1*06:02:01 DRB1*04:01:01 DRB1*15:01:01
SRR8281250 A*01:01:01 A*02:01:01 B*35:01:01 B*08:01:01 C*07:01:01 C*04:01:01 DQA1*05:01:01 DQA1*01:01:01 DQB1*02:01:01 DQB1*05:01:01 DRB1*01:01:01 DRB1*03:01:01
SRR8281244 A*25:01:01 A*02:01:01 B*18:01:01 B*08:01:01 C*07:02:01 C*12:03:01 DQA1*01:02:01 DQA1*01:02:01 DQB1*06:02:01 DQB1*06:02:01 DRB1*15:01:01 DRB1*15:01:01
SRR8281218 A*26:01:01 A*03:01:01 B*35:01:01 B*07:02:01 C*07:02:01 C*04:01:01 DQA1*02:01:01 DQA1*03:01:01 DQB1*03:02:01 DQB1*03:02:01 DRB1*04:02:01 DRB1*07:01:01
```
## HLA Oncoplot
To better visualize and compare the HLA distribution in different samples, we use arcasHLA to convert HLA results to P-group format. The P-group format shows the alleles which share the same amino acid sequence in the antigen-binding region. (eg. A*01:01P, 'P' indicates p-group nomenclature)
```
###
#HLA summary file: genotypes.tsv
arcasHLA convert -r p-group genotypes.tsv -o genotypes.p-group.tsv
###
```
```
subject A1 A2 B1 B2 C1 C2 DQA11 DQA12 DQB11 DQB12 DRB11 DRB12
SRR8281233 A*01:01P A*02:01P B*57:01P B*44:02P C*05:01P C*06:02P DQA1*01:02P DQA1*03:01P DQB1*06:02P DQB1*03:02P DRB1*04:01P DRB1*15:01P
SRR8281236 A*33:01P A*24:02P B*14:02P B*15:01P C*06:02P C*08:02P DQA1*01:02P DQA1*03:01P DQB1*03:02P DQB1*05:02P DRB1*04:03P DRB1*16:02P
SRR8281243 A*01:01P A*24:02P B*35:02P B*41:01P C*04:01P C*17:01P DQA1*01:02P DQA1*01:01P DQB1*05:01P DQB1*06:09P DRB1*10:01P DRB1*13:02P
SRR8281244 A*25:01P A*02:01P B*18:01P B*08:01P C*07:02P C*12:03P DQA1*01:02P DQA1*01:02P DQB1*06:02P DQB1*06:02P DRB1*15:01P DRB1*15:01P
SRR8281250 A*01:01P A*02:01P B*35:01P B*08:01P C*07:01P C*04:01P DQA1*05:01P DQA1*01:01P DQB1*02:01P DQB1*05:01P DRB1*01:01P DRB1*03:01P
SRR8281219 A*03:01P A*26:01P B*35:01P B*07:02P C*07:02P C*04:01P DQA1*02:01P DQA1*03:01P DQB1*03:02P DQB1*03:03P DRB1*04:02P DRB1*07:01P
SRR8281245 A*25:01P A*02:01P B*18:01P B*08:01P C*07:02P C*12:03P DQA1*01:02P DQA1*01:02P DQB1*06:02P DQB1*06:02P DRB1*15:01P DRB1*15:01P
SRR8281230 A*01:01P A*02:01P B*57:01P B*44:02P C*05:01P C*06:02P DQA1*03:01P DQA1*01:02P DQB1*06:02P DQB1*03:02P DRB1*04:01P DRB1*15:01P
SRR8281251 A*24:02P A*01:01P B*35:02P B*41:01P C*04:01P C*17:01P DQA1*01:02P DQA1*01:01P DQB1*05:01P DQB1*06:09P DRB1*10:01P DRB1*13:02P
SRR8281218 A*26:01P A*03:01P B*35:01P B*07:02P C*07:02P C*04:01P DQA1*02:01P DQA1*03:01P DQB1*03:02P DQB1*03:02P DRB1*04:02P DRB1*07:01P
SRR8281238 A*01:01P A*02:01P B*35:01P B*08:01P C*07:01P C*04:01P DQA1*05:01P DQA1*01:01P DQB1*02:01P DQB1*05:01P DRB1*01:01P DRB1*03:01P
SRR8281226 A*33:01P A*24:02P B*14:02P B*15:01P C*06:02P C*08:02P DQA1*03:01P DQA1*01:02P DQB1*05:02P DQB1*03:02P DRB1*04:03P DRB1*16:02P
```
RIMA generates an HLA oncoplot which shows the most frequent HLA alleles across patient cohorts and the expression level of HLA genes (HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, and HLA-DRB1). The percentages on the left of the heatmap represent the frequency of the allele among samples. The colors in the heatmap indicate which allele (e.g. A = HLA-A) and whether the allele was found in the first (A1), the second (A2) or both copies of the patient's gene. The colored bars to the right indicate the absolute number of alleles found. The x axis represents patient samples. The samples are sorted by group and then by the expression of HLA genes within the group. The HLA expression shown on the top represents the mean TPM of HLA genes in each sample.
```{r}
#load HLA p-group summary file
hla <- read.table("genotypes.p-group.tsv", header = TRUE, sep = "\t")
#load metasheet
meta <- read.table("metasheet.csv", sep = ",", header = TRUE, row.names = 1)
#load gene expression matrix
exprn <- read.table("tpm_convertID.txt", header = TRUE, row.names = 1, sep = ",", check.names = FALSE)
source("hla_oncoplot.R")
#group indicates which columns are used for hla comparison
index <- c("Responder", "Gender")
p <- hla_oncoplot(hla, exprn, meta, groups = index)
print(p)
```
## Neoantigen Identification
To identify neoantigens, we will need the somatic mutation and HLA allele information introduced in previous sub-chapters. The standard approach for somatic mutation detection is the analysis of whole-exome (WES) or whole-genome sequencing data from tumor-normal matched pairs. Computational algorithms are used to call mutations from the resulting data. For example, MuTect is based on a Bayesian classifier, MuSE is based on a Bayesian Markov model, SomaticSniper is based on a Bayesian comparison of genotype likelihoods, and Varscan2 is based on a heuristic algorithm classifying somatic status. Among these approaches, Varscan2 can be applied to RNA-seq data with good coverage, although the mutation calls are still expected to be noisier than those from DNA sequencing. After somatic mutation detection, VEP can be used to annotate mutations on genes, transcripts, and regulatory regions. Sinee it's not common to have deep-sequenced tumor-normal paired samples, current RIMA is not supporting the function of calling somatic mutation from RNA-seq data. In this chapter, we are showing the example of identifying expressed neoantigens integrating both RNA-seq and WES data.
Here we use one matched tumor-normal sample for both RNA and WES of the same patient from the Zhao trial as an example. We use RIMA preprocessing module to get the Salmon gene quantification [SRR7588416.format.quant.sf](./neoantigen/SRR7588416/SRR7588416.format.quant.sf). We use the WES pipeline to process the tumor fastq reads (SRR7588416) and matched normal fastqs (SRR7588421): Somatic Mutation is being called using <a href = 'https://support.sentieon.com/manual/TNscope_usage/tnscope/'>Sentieon Somatic Caller </a> to output a VCF file. Next, we annotate the VCF file with <a href='https://useast.ensembl.org/info/docs/tools/vep/index.html'>Ensembl Variant Effect Predictor (VEP) </a>. [SRR7588416_tnscope.neoantigen.vep.vcf](./neoantigen/SRR7588416_tnscope.neoantigen.vep.vcf) is the example of VEP annotated VCF output. The <a href='https://vatools.readthedocs.io/en/latest/vcf_expression_annotator.html'>'vcf-expression-annotator'</a> tool from <a href="https://github.com/griffithlab/pVACtools">pVAC-Tools</a> can be used to add the expression information from the outputs obtained from Salmon. This will add "TX" (i.e Transcript Expression) tag to the 'FORMAT' column of the VCF output [SRR7588416_tnscope.neoantigen.vep.rna.vcf](./neoantigen/SRR7588416_tnscope.neoantigen.vep.rna.vcf).
<a href="https://github.com/FRED-2/OptiType">Optitype</a> and <a href="https://github.com/humanlongevity/HLA">xHLA</a> are used to get the MHC Class I&II allels results. [SRR7588416_result_optitype.tsv](./neoantigen/SRR7588416_result_optitype.tsv) and [SRR7588416_result_xHLA.json](./neoantigen/report-SRR7588416-hla.json) stores HLA alleles results respectively.
After generating the somatic hypermutation with expression information, and MHC class I&II alleles, we used <a href='https://pvactools.readthedocs.io/en/latest/pvacseq/getting_started.html#running-pvacseq-using-docker'>'pvacseq run'</a> command to estimate the peptide binding affinity to predicted HLAs. [SRR7588416.all_epitopes.tsv](./neoantigen/SRR7588416/combined/SRR7588416.all_epitopes.tsv) is th example results of MCH Class I&II peptides with length ranges from 8 to 18 predicted by NetMHCpan and NetMHCIIpan. [SRR7588416.all_epitopes.aggregated.tsv](./neoantigen/SRR7588416/combined/SRR7588416.all_epitopes.aggregated.tsv) is the aggreagated results of the best-scoring (lowest binding affinity) epitope. Detailed information of columns in each output file can be found [here](https://pvactools.readthedocs.io/en/latest/pvacseq/output_files.html).