analyses/molecular-subtyping-HGG/05-HGG-molecular-subtyping-fusion.Rmd

---
title: "High-Grade Glioma Molecular Subtyping - Fusions"
output: 
  html_notebook:
    toc: TRUE
    toc_float: TRUE
author: Chante Bethell and Jaclyn Taroni for ALSF CCDL
date: 2020
---

This notebook prepares putative oncogenic data for the purpose of subtyping HGG samples 
([`AlexsLemonade/OpenPBTA-analysis#249`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249)).

## Usage

This notebook is intended to be run via the command line from the top directory
of the repository as follows:

```
Rscript -e "rmarkdown::render('analyses/molecular-subtyping-HGG/05-HGG-molecular-subtyping-fusion.Rmd', clean = TRUE)"
```

## Set up

### Libraries and Functions

```{r}
library(tidyverse)
```

### Directories

```{r}
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
# File path to results directory
input_dir <-
  file.path(root_dir, "analyses", "molecular-subtyping-HGG", "hgg-subset")
# File path to data directory
data_dir <- file.path(root_dir, "data")
# File path to results directory
results_dir <-
  file.path(root_dir, "analyses", "molecular-subtyping-HGG", "results")
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}
```

### Read in Files

The fusion file was  from `fusion-summary` module, which included the types of gene fusion and Biospecimen_id for LGG/HGG.

Crossing fusion file with hgg_metadata file to select HGG cohort.

```{r}
hgg_meta <- read_tsv(file.path(input_dir, "hgg_metadata.tsv"))

fusion_hgg <- read_tsv(file.path(data_dir, "fusion_summary_lgg_hgg_foi.tsv")) %>%
  filter(Kids_First_Biospecimen_ID %in% hgg_meta$Kids_First_Biospecimen_ID)

```

### Output File

```{r}
output_file <- file.path(results_dir, "HGG_cleaned_fusion.tsv")
```

## What fusions are we interested in?

There are two genes of interest according to [`AlexsLemonade/OpenPBTA-analysis#249`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249).
Specifically:

* _FGFR1_ fusions should be mutually exclusive of H3 K28 mutations 
* _NTRK_ fusions are co-occurring with H3 G35 mutations

In the original issue, there is no mention of specific fusion partners or orientations.
Thus, we will look at _any_ instances of fusions that contain _FGFR1_ or _NTRK_.

## New subtype of HGG

There is a new 2021 entity within HGGs called "Infant-type hemispheric glioma" (IHG). This HGG is cerebral (hemispheric), arises in early childhood, and is characterized by RTK (receptor tyrosine kinase) alterations, typically fusions, in the NTRK family or in ROS1, ALK, or MET.

The subtypes are:

IHG, NTRK-altered
IHG, ROS1-altered
IHG, ALK-altered
IHG, MET-altered

The gene fusion with _ROS1_, _ALK_ and _MET_ are added into the output


## Filter and Wrangle Fusion Data

```{r}

RTK_list <- c("FGFR1", "NTRK", "MET", "ROS1", "ALK")

summary_df <- fusion_hgg %>%
  mutate(HGG_Fusion_evidence = apply(.[,colnames(.)[grepl(paste(RTK_list, collapse = "|"), colnames(.))]], 1, FUN =  function(x) {paste(names(x)[x == 1],collapse="/")})) %>%
  mutate(HGG_Fusion_counts = apply(.[,colnames(.)[grepl(paste(RTK_list, collapse = "|"), colnames(.))]], 1, FUN =  function(x) {length(names(x)[x == 1])})) %>%
  mutate(NTRK_fusions = apply(.[,colnames(.)[grepl("NTRK", colnames(.))]], 1, FUN =  function(x) {paste(names(x)[x == 1],collapse="/")})) %>% 
  mutate(MET_fusions = apply(.[,colnames(.)[grepl("MET", colnames(.))]], 1, FUN =  function(x) {paste(names(x)[x == 1],collapse="/")})) %>% 
  mutate(ROS1_fusions = apply(.[,colnames(.)[grepl("ROS1", colnames(.))]], 1, FUN =  function(x) {paste(names(x)[x == 1],collapse="/")})) %>% 
  mutate(ALK_fusions = apply(.[,colnames(.)[grepl("ALK", colnames(.))]], 1, FUN =  function(x) {paste(names(x)[x == 1],collapse="/")})) %>% 
  mutate(FGFR1_fusions = apply(.[,colnames(.)[grepl("FGFR1", colnames(.))]], 1, FUN =  function(x) {paste(names(x)[x == 1],collapse="/")})) %>% 
  select(c("Kids_First_Biospecimen_ID", "NTRK_fusions", "MET_fusions", "ROS1_fusions", "ALK_fusions", "FGFR1_fusions", "HGG_Fusion_evidence", "HGG_Fusion_counts")) %>% 
  dplyr::arrange(Kids_First_Biospecimen_ID)

summary_df[summary_df == ""] = "None"

head(summary_df[order(summary_df$FGFR1_fusions),])
```

There is a single _FGFR1_ fusion present in the filtered and prioritized fusion table for samples that were included in the HGG subset file.
Recall that samples in the HGG subset file either had a defining lesion (H3 K28 or H3 G35 mutation) _or_ were labeled `HGAT` in the `short_histology` column of `pbta-histologies-base.tsv`.

_FGFR1_ fusions are _mutually exclusive_ so this result is not necessarily unexpected.

_NTRK_ refers to [a _family_ of receptor kinases](https://www.biooncology.com/pathways/cancer-tumor-targets/ntrk/ntrk-oncogenesis.html) and not an individual gene symbol.
To retain information about which gene we're talking about we will include the full fusion name in the table when we present this information.


Write file.

```{r}
summary_df %>% 
  write_tsv(output_file)
```

## Session Info

```{r}
sessionInfo()
```