forked from d3b-center/OpenPedCan-analysis
-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path05-HGG-molecular-subtyping-fusion.Rmd
executable file
·135 lines (94 loc) · 4.88 KB
/
05-HGG-molecular-subtyping-fusion.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "High-Grade Glioma Molecular Subtyping - Fusions"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: Chante Bethell and Jaclyn Taroni for ALSF CCDL
date: 2020
---
This notebook prepares putative oncogenic data for the purpose of subtyping HGG samples
([`AlexsLemonade/OpenPBTA-analysis#249`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249)).
## Usage
This notebook is intended to be run via the command line from the top directory
of the repository as follows:
```
Rscript -e "rmarkdown::render('analyses/molecular-subtyping-HGG/05-HGG-molecular-subtyping-fusion.Rmd', clean = TRUE)"
```
## Set up
### Libraries and Functions
```{r}
library(tidyverse)
```
### Directories
```{r}
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
# File path to results directory
input_dir <-
file.path(root_dir, "analyses", "molecular-subtyping-HGG", "hgg-subset")
# File path to data directory
data_dir <- file.path(root_dir, "data")
# File path to results directory
results_dir <-
file.path(root_dir, "analyses", "molecular-subtyping-HGG", "results")
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
```
### Read in Files
The fusion file was from `fusion-summary` module, which included the types of gene fusion and Biospecimen_id for LGG/HGG.
Crossing fusion file with hgg_metadata file to select HGG cohort.
```{r}
hgg_meta <- read_tsv(file.path(input_dir, "hgg_metadata.tsv"))
fusion_hgg <- read_tsv(file.path(data_dir, "fusion_summary_lgg_hgg_foi.tsv")) %>%
filter(Kids_First_Biospecimen_ID %in% hgg_meta$Kids_First_Biospecimen_ID)
```
### Output File
```{r}
output_file <- file.path(results_dir, "HGG_cleaned_fusion.tsv")
```
## What fusions are we interested in?
There are two genes of interest according to [`AlexsLemonade/OpenPBTA-analysis#249`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249).
Specifically:
* _FGFR1_ fusions should be mutually exclusive of H3 K28 mutations
* _NTRK_ fusions are co-occurring with H3 G35 mutations
In the original issue, there is no mention of specific fusion partners or orientations.
Thus, we will look at _any_ instances of fusions that contain _FGFR1_ or _NTRK_.
## New subtype of HGG
There is a new 2021 entity within HGGs called "Infant-type hemispheric glioma" (IHG). This HGG is cerebral (hemispheric), arises in early childhood, and is characterized by RTK (receptor tyrosine kinase) alterations, typically fusions, in the NTRK family or in ROS1, ALK, or MET.
The subtypes are:
IHG, NTRK-altered
IHG, ROS1-altered
IHG, ALK-altered
IHG, MET-altered
The gene fusion with _ROS1_, _ALK_ and _MET_ are added into the output
## Filter and Wrangle Fusion Data
```{r}
RTK_list <- c("FGFR1", "NTRK", "MET", "ROS1", "ALK")
summary_df <- fusion_hgg %>%
mutate(HGG_Fusion_evidence = apply(.[,colnames(.)[grepl(paste(RTK_list, collapse = "|"), colnames(.))]], 1, FUN = function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
mutate(HGG_Fusion_counts = apply(.[,colnames(.)[grepl(paste(RTK_list, collapse = "|"), colnames(.))]], 1, FUN = function(x) {length(names(x)[x == 1])})) %>%
mutate(NTRK_fusions = apply(.[,colnames(.)[grepl("NTRK", colnames(.))]], 1, FUN = function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
mutate(MET_fusions = apply(.[,colnames(.)[grepl("MET", colnames(.))]], 1, FUN = function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
mutate(ROS1_fusions = apply(.[,colnames(.)[grepl("ROS1", colnames(.))]], 1, FUN = function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
mutate(ALK_fusions = apply(.[,colnames(.)[grepl("ALK", colnames(.))]], 1, FUN = function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
mutate(FGFR1_fusions = apply(.[,colnames(.)[grepl("FGFR1", colnames(.))]], 1, FUN = function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
select(c("Kids_First_Biospecimen_ID", "NTRK_fusions", "MET_fusions", "ROS1_fusions", "ALK_fusions", "FGFR1_fusions", "HGG_Fusion_evidence", "HGG_Fusion_counts")) %>%
dplyr::arrange(Kids_First_Biospecimen_ID)
summary_df[summary_df == ""] = "None"
head(summary_df[order(summary_df$FGFR1_fusions),])
```
There is a single _FGFR1_ fusion present in the filtered and prioritized fusion table for samples that were included in the HGG subset file.
Recall that samples in the HGG subset file either had a defining lesion (H3 K28 or H3 G35 mutation) _or_ were labeled `HGAT` in the `short_histology` column of `pbta-histologies-base.tsv`.
_FGFR1_ fusions are _mutually exclusive_ so this result is not necessarily unexpected.
_NTRK_ refers to [a _family_ of receptor kinases](https://www.biooncology.com/pathways/cancer-tumor-targets/ntrk/ntrk-oncogenesis.html) and not an individual gene symbol.
To retain information about which gene we're talking about we will include the full fusion name in the table when we present this information.
Write file.
```{r}
summary_df %>%
write_tsv(output_file)
```
## Session Info
```{r}
sessionInfo()
```