AlexsLemonade#959 HGG update hotspots (AlexsLemonade#1077)

* add hotspots maf * add hotspots and run 02 from focal-cn-file-prep * add hotspots + consensus maf * Update analyses/tp53_nf1_score/05-tp53-altered-annotation.Rmd Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> * Update analyses/tp53_nf1_score/05-tp53-altered-annotation.Rmd Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> * Update analyses/tp53_nf1_score/05-tp53-altered-annotation.Rmd Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> * updating cnv loss filter * updating doc in cnv loss nb * adding 02 html from focal-cn * add hotspots+consensus maf * rerun * rerun add back lgat filter * update README * Update analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com> * updat fread to select from vector * only keep TERT promoter muts * Update README.md Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
fanshijianpharmacy · May 28, 2021 · c074457 · c074457
1 parent 3dc359e
commit c074457
Show file tree

Hide file tree

Showing 14 changed files with 301 additions and 229 deletions.
diff --git a/analyses/README.md b/analyses/README.md
@@ -32,7 +32,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`molecular-subtyping-CRANIO`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-CRANIO) | `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `pbta-snv-scavenged-hotspots.maf.tsv.gz`| Molecular subtyping of craniopharyngiomas samples [#810](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/810) | `results/CRANIO_molecular_subtype.tsv`
 | [`molecular-subtyping-EPN`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EPN) | `pbta-histologies-base.tsv` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-cnv-consensus-gistic.zip` <br> `analyses/chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv` <br> `analyses/fusion-summary/results/fusion_summary_ependymoma_foi.tsv` <br> `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` | *In progress*; molecular subtyping of ependymoma tumors | `results/EPN_all_data_withsubgroup.tsv`
 | [`molecular-subtyping-EWS`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EWS) | `pbta-histologies-base.tsv` <br> `analyses/fusion-summary/results/fusion_summary_ewings_foi.tsv`| Reclassification of tumors based on the presence of defining fusions for Ewing Sarcoma per [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) | `results/EWS_samples.tsv`
-| [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) | `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `analyses/focal-cn-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `analyses/fusion_filyering/results/pbta-fusion-putative-oncogenic.tsv` <br> `pbta-cnv-consensus-gistic.zip` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | Molecular subtyping of high-grade glioma samples [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) | `results/HGG_molecular_subtype.tsv`
+| [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) | `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `pbta-snv-scavenged-hotspots.maf.tsv.gz` <br> `analyses/focal-cn-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `analyses/fusion_filyering/results/pbta-fusion-putative-oncogenic.tsv` <br> `pbta-cnv-consensus-gistic.zip` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | Molecular subtyping of high-grade glioma samples [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) | `results/HGG_molecular_subtype.tsv`
 | [`molecular-subtyping-LGAT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-LGAT)| `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `pbta-snv-scavenged-hotspots.maf.tsv.gz` <br> `analyses/fusion_filtering/results/pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-recurrently-fused-genes-bysample.tsv`| Molecular subtyping of Low-grade astrocytic tumor samples [#631](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/631) | `results/lgat_subtyping.tsv`
 | [`molecular-subtyping-MB`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-MB) | `pbta-histologies-base.tsv` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Molecular classification of Medulloblastoma subtypes (part of [#731](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/731)) | `results/MB_molecular_subtype.tsv` <br> `results/MB_batchcorrected_molecular_subtype.tsv` <br> for uncorrected and batch-corrected input matrix
 | [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | *Deprecated*; Identify the SHH-classified medulloblastoma samples that have TP53 mutations [#247](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/247) | N/A

diff --git a/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd b/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd
@@ -81,21 +81,48 @@ lgat_specimens <- metadata %>%
   pull(Kids_First_Biospecimen_ID)
 ```
 
-# Read in snv consensus mutation data, filtering out LGAT
+# Read in snv consensus and hotspots mutation data, filtering out LGAT
+ - We will use the `pbta-snv-consensus-mutation.maf.tsv.gz` from `snv-callers` module which gathers calls that are present in all 3 callers (strelka2, mutect2, and lancet)
+ - In addition, we will also use `pbta-snv-scavenged-hotspots.maf.tsv.gz` from `hotspot-detection` module to gather calls that overlap MSKCC hotspots found in any caller (except if only vardict calls the site as variant, we remove these calls since we have a lot of calls unique to vardict which we consider as false positive as discussed [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers#snv-caller-comparison-analysis))
+
 ```{r}
-snv_df <-
-  data.table::fread(file.path(root_dir,
-                              "data",
-                              "pbta-snv-consensus-mutation.maf.tsv.gz")) %>%
+
+# select tumor sample barcode, gene, short protein annotation and variant classification
+keep_cols <- c("Chromosome",
+             "Start_Position",
+             "End_Position",
+             "Strand",
+             "Variant_Classification",
+             "IMPACT",
+             "Tumor_Sample_Barcode",
+             "Hugo_Symbol",
+             "HGVSp_Short",
+             "Exon_Number")
+
+snv_consensus_maf <- data.table::fread(
+  file.path(root_dir, "data" , "pbta-snv-consensus-mutation.maf.tsv.gz"),
+                                   select = keep_cols,
+                                   data.table = FALSE) 
+## Read in snv hotspot mutation data
+snv_hotspot_maf <- data.table::fread(
+  file.path(root_dir, "analyses" , "hotspots-detection" , "results" , "pbta-snv-scavenged-hotspots.maf.tsv.gz"),
+                                   select = keep_cols,
+                                   data.table = FALSE) %>%
+  select(colnames(snv_consensus_maf))
+
+snv_consensus_hotspot_maf <- snv_consensus_maf %>%
+  bind_rows(snv_hotspot_maf) %>%
+  unique() %>%
   filter(!Tumor_Sample_Barcode %in% lgat_specimens) 
+
 ```
 
 
 ## SNV consensus mutation data - defining lesions
 
 ```{r}
 # Filter the snv consensus mutation data for the target lesions
-snv_lesions_df <- snv_df %>%
+snv_lesions_df <- snv_consensus_hotspot_maf  %>%
   dplyr::filter(Hugo_Symbol %in% c("H3F3A", "HIST1H3B",
                                    "HIST1H3C", "HIST2H3C") &
                   HGVSp_Short %in% c("p.K28M", "p.G35R",
@@ -130,7 +157,7 @@ snv_lesions_df <- snv_df %>%
 snv_lesions_df <- snv_lesions_df %>%
   dplyr::bind_rows(
     data.frame(
-      Tumor_Sample_Barcode = setdiff(unique(snv_df$Tumor_Sample_Barcode),
+      Tumor_Sample_Barcode = setdiff(unique(snv_consensus_hotspot_maf $Tumor_Sample_Barcode),
                                      snv_lesions_df$Tumor_Sample_Barcode)
     )
   ) %>%
@@ -181,4 +208,4 @@ readr::write_tsv(snv_lesions_df,
 ```{r}
 # Print the session information
 sessionInfo()
-```
+```
diff --git a/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.nb.html b/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.nb.html
diff --git a/analyses/molecular-subtyping-HGG/02-HGG-molecular-subtyping-subset-files.R b/analyses/molecular-subtyping-HGG/02-HGG-molecular-subtyping-subset-files.R
@@ -92,19 +92,32 @@ gistic_df <- data.table::fread(file.path(root_dir,
 
 
 # Read in snv consensus mutation data
-snv_maf_df <-
-  data.table::fread(file.path(root_dir,
-                              "data",
-                              "pbta-snv-consensus-mutation.maf.tsv.gz"),
-                    select = c("Chromosome",
-                               "Start_Position",
-                               "End_Position",
-                               "Strand",
-                               "Variant_Classification",
-                               "Tumor_Sample_Barcode",
-                               "Hugo_Symbol",
-                               "HGVSp_Short"),
-                    data.table = FALSE)
+# select tumor sample barcode, gene, short protein annotation and variant classification
+keep_cols <- c("Chromosome",
+               "Start_Position",
+               "End_Position",
+               "Strand",
+               "Variant_Classification",
+               "IMPACT",
+               "Tumor_Sample_Barcode",
+               "Hugo_Symbol",
+               "HGVSp_Short",
+               "Exon_Number")
+
+snv_consensus_maf <- data.table::fread(
+  file.path(root_dir, "data" , "pbta-snv-consensus-mutation.maf.tsv.gz"),
+  select = keep_cols,
+  data.table = FALSE) 
+## Read in snv hotspot mutation data
+snv_hotspot_maf <- data.table::fread(
+  file.path(root_dir, "analyses" , "hotspots-detection" , "results" , "pbta-snv-scavenged-hotspots.maf.tsv.gz"),
+  select = keep_cols,
+  data.table = FALSE) %>%
+  select(colnames(snv_consensus_maf))
+
+snv_consensus_hotspot_maf <- snv_consensus_maf %>%
+  bind_rows(snv_hotspot_maf) %>%
+  unique()
 
 # Read in output file from `01-HGG-molecular-subtyping-defining-lesions.Rmd`
 hgg_lesions_df <- read_tsv(
@@ -257,12 +270,12 @@ write_tsv(gistic_df,
 
 #### Filter SNV consensus maf data ---------------------------------------------
 
-snv_maf_df <- snv_maf_df %>%
+snv_consensus_hotspot_maf <- snv_consensus_hotspot_maf %>%
   left_join(select_metadata,
             by = c("Tumor_Sample_Barcode" = "Kids_First_Biospecimen_ID")) %>%
   filter(Tumor_Sample_Barcode %in% hgg_metadata_df$Kids_First_Biospecimen_ID) %>%
   arrange(Kids_First_Participant_ID, sample_id)
 
 # Write to file
-write_tsv(snv_maf_df,
+write_tsv(snv_consensus_hotspot_maf,
           file.path(subset_dir, "hgg_snv_maf.tsv.gz"))
diff --git a/analyses/molecular-subtyping-HGG/04-HGG-molecular-subtyping-mutation.Rmd b/analyses/molecular-subtyping-HGG/04-HGG-molecular-subtyping-mutation.Rmd
@@ -175,7 +175,9 @@ We likely want to be permissive with the _TERT_ mutations in terms of *regions.*
 ```{r}
 tert_snv_df <- filtered_snv_df %>%
   dplyr::filter(Hugo_Symbol == "TERT",
-                Variant_Classification != "Silent")
+                Variant_Classification == "5'Flank",
+                Start_Position %in% c("1295113","1295135"),
+                End_Position %in% c("1295113","1295135"))
 
 tert_snv_df
 ```