class 24 edits and ps

rnabioco · Oct 5, 2023 · faa6463 · faa6463
1 parent d2a572f
commit faa6463
Show file tree

Hide file tree

Showing 26 changed files with 1,183 additions and 1,076 deletions.
diff --git a/_freeze/exercises/ex-24/execute-results/html.json b/_freeze/exercises/ex-24/execute-results/html.json
diff --git a/_freeze/problem-sets/ps-23/execute-results/html.json b/_freeze/problem-sets/ps-23/execute-results/html.json
@@ -1,8 +1,10 @@
 {
-  "hash": "4cc2f2fa34486778caabe16164fe69fe",
+  "hash": "04723a7f5892e96e8b0a86b4295f76a6",
   "result": {
-    "markdown": "---\ntitle: \"RNA Block - Problem Set 23\"\n---\n\n\n## Problem Set\n\nTotal points: 20. First problem is worth 10 points, second and third problems are worth 5 points.\n\n## Load libraries\n\nStart by loading libraries you need analysis in the code chunk below.\n\n\n\n\n\nWe have an experiment where we can take neuronal cells and mechanically separate them into soma and neurite fractions. By sequencing RNA from both of these fractions and comparing the relative abundances of RNAs, we can get a sense of how neurite-localized every RNA is. We can also combine this approach with knockouts of specific RBPs. If we have an RBP that we think is involved in this process, we can do this subcellular fractionation in sequencing in both WT and RBP-knockout (KO) cells. Transcripts that depend upon the RBP for transcript to the neurite should be less neurite-enriched in the KO samples than the WT samples.\n\nWe recently completed this process in mouse cells that lack the RBP TDP-43. We have RNA sequence data for 4 conditions: WT soma, WT neurite, KO soma, and KO neurite with 3 replicates of each condition. These samples have been quantified with `salmon`.\n\nRead in this data, collapse `salmon`'s transcript-level quantification to gene-level quantification with `tximport`. Then assess the quality of this data by performing hierarchical clustering of pairwise spearman correlation values and PCA analysis of TPM expression values.\n\nThe salmon data lives in `data/block-rna/salmon_tdp43`. In that directory, you will find one `salmon` output directory for each sample.\n\n## Q1: read in salmon data (10 pts)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#There are some hints to help you get started\n\n#Use biomaRt to get a table of transcript/gene relationships\nmart <- biomaRt::useMart(\n  \"ENSEMBL_MART_ENSEMBL\",\n  dataset = \"??\",\n  host = \"www.ensembl.org\"\n)\n\nt2g <- biomaRt::getBM(attributes = c(\"ensembl_transcript_id\", \"ensembl_gene_id\", \"external_gene_name\"), mart = mart) |>\n  dplyr::select(??, ??)\n\n\n\n#Read in salmon quantification files\n\nmetadata <- data.frame(sample_id = list.files(here(\"??\")),\n                   salmon_dirs = list.files(here(\"??\"),recursive = T,pattern = \".gz$\", full.names = T)\n\n                   ) |> \n  separate(col = ??,\n           into = c(\"cell\",\"loc\",\"geno\",\"rep\"),\n           sep = \"??\",\n           remove = F)\n\nmetadata$rep <- gsub(pattern = \"Rep\", replacement = \"\", metadata$rep) \n\nrownames(metadata) <- metadata$sample_id\n\n\n#Get gene-level TPM values with tximport\n\nsalmdir <- metadata$??\nnames(salmdir) <- metadata$??\n\n\ntxi <- tximport(files = ??,\n  type = \"salmon\",\n  tx2gene = ??,\n  dropInfReps = TRUE,\n  countsFromAbundance = \"lengthScaledTPM\"\n)\n\n#Filter genes to remove those that are not expressed at at least 1 TPM in EVERY sample\ntpms <- txi$?? |>\n  as.data.frame() |>\n  rownames_to_column(var = \"ensembl_gene_id\")\n\n\ntpms.cutoff <-\n  mutate(tpms, nSamples = rowSums(tpms[, 2:??] > 1)) |>\n  filter(nSamples >= ??) |>\n  \n  dplyr::select(-nSamples)\n```\n:::\n\n\n\n## Q2: make correlation heatmap (5 pts)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Use cor() to get a matrix of pairwise correlations between samples\ntpms.cor <- cor(??, method = \"??\")\n\n#Use pheatmap() to plot correlation matrix\n\npheatmap(\n  ??,\n  annotation_col = metadata[,??], # what would be interesting to add as colored categories\n  fontsize = 7,\n  show_colnames = FALSE\n)\n```\n:::\n\n\n![what your answer should look like](/img/block-rna/tdp43_heatmap.png){width=\"75%\"}\n\n> Provide 1-2 sentences of interpretation of the similarity of the samples based on the heatmap.\n\n## Q3: make PCA plot (5 pts)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Start with the filtered TPM table from above\ntpms.cutoff.matrix <- tpms.cutoff |>\n  dplyr::select(-??) |>\n  as.??()\n\n\n#Use prcomp() to derive principle component coordinants of *LOGGED*  and *Scaled* TPM values\ntpms.cutoff.matrix <- log??(tpms.cutoff.matrix + ??)\n\n# scale\ntpms.cutoff.matrix <- ??(scale(??(tpms.cutoff.matrix)))\n\n\n# principle components\ntpms.pca <- prcomp(t(tpms.cutoff.matrix))\n\n#Add annotations of the cell compartment (soma / neurite) and TDP-43 status (WT / KO) of the samples\ntpms.pca.pc <- tpms.pca$x %>%\n  as.data.frame() %>%\n  rownames_to_column(var = \"sample_id\") %>% \n  left_join(., metadata[,c(1,??)], by = \"??\")\n\n## \n\ntpms.pca.summary <- summary(tpms.pca)$importance\npc1var <- round(tpms.pca.summary[2, 1] * 100, 1)\npc2var <- round(tpms.pca.summary[2, 2] * 100, 1)\n\n#Plot PCA data\n\nggplot(data = tpms.pca.pc,\n  aes(\n    x = PC1, y = PC2,\n    color = paste(??,??), label = sample_id\n  )\n) +\n  geom_point(size = 5) +\n  scale_color_brewer(palette = \"Set1\") +\n  theme_cowplot(16) +\n  labs(\n    x = paste(\"PC1,\", pc1var, \"% explained var.\"),\n    y = paste(\"PC2,\", pc2var, \"% explained var.\")\n  ) +\n  geom_text_repel()\n```\n:::\n\n\n![what your answer should look like](/img/block-rna/tdp43_pca.png){width=\"75%\"}\n\n> Provide 1-2 sentences of interpretation of the similarity of the samples based on the heatmap.\n",
-    "supporting": [],
+    "markdown": "---\ntitle: \"RNA Block - Problem Set 23\"\n---\n\n\n## Problem Set\n\nTotal points: 20. First problem is worth 10 points, second and third problems are worth 5 points.\n\n## Load libraries\n\nStart by loading libraries you need analysis in the code chunk below.\n\n\n\n\n\nWe have an experiment where we can take neuronal cells and mechanically separate them into soma and neurite fractions. By sequencing RNA from both of these fractions and comparing the relative abundances of RNAs, we can get a sense of how neurite-localized every RNA is. We can also combine this approach with knockouts of specific RBPs. If we have an RBP that we think is involved in this process, we can do this subcellular fractionation in sequencing in both WT and RBP-knockout (KO) cells. Transcripts that depend upon the RBP for transcript to the neurite should be less neurite-enriched in the KO samples than the WT samples.\n\nWe recently completed this process in mouse cells that lack the RBP TDP-43. We have RNA sequence data for 4 conditions: WT soma, WT neurite, KO soma, and KO neurite with 3 replicates of each condition. These samples have been quantified with `salmon`.\n\nRead in this data, collapse `salmon`'s transcript-level quantification to gene-level quantification with `tximport`. Then assess the quality of this data by performing hierarchical clustering of pairwise spearman correlation values and PCA analysis of TPM expression values.\n\nThe salmon data lives in `data/block-rna/salmon_tdp43`. In that directory, you will find one `salmon` output directory for each sample.\n\n## Q1: read in salmon data (10 pts)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#There are some hints to help you get started\n\n#Use biomaRt to get a table of transcript/gene relationships\nmart <- biomaRt::useMart(\n  \"ENSEMBL_MART_ENSEMBL\",\n  dataset = \"??\",\n  host = \"www.ensembl.org\"\n)\n\nt2g <- biomaRt::getBM(attributes = c(\"ensembl_transcript_id\", \"ensembl_gene_id\", \"external_gene_name\"), mart = mart) |>\n  dplyr::select(??, ??)\n\n\n\n#Read in salmon quantification files\n\nmetadata <- data.frame(sample_id = list.files(here(\"??\")),\n                   salmon_dirs = list.files(here(\"??\"),recursive = T,pattern = \".gz$\", full.names = T)\n\n                   ) |> \n  separate(col = ??,\n           into = c(\"cell\",\"loc\",\"geno\",\"rep\"),\n           sep = \"??\",\n           remove = F)\n\nmetadata$rep <- gsub(pattern = \"Rep\", replacement = \"\", metadata$rep) \n\nrownames(metadata) <- metadata$sample_id\n\n\n#Get gene-level TPM values with tximport\n\nsalmdir <- metadata$??\nnames(salmdir) <- metadata$??\n\n\ntxi <- tximport(files = ??,\n  type = \"salmon\",\n  tx2gene = ??,\n  dropInfReps = TRUE,\n  countsFromAbundance = \"lengthScaledTPM\"\n)\n\n#Filter genes to remove those that are not expressed at at least 1 TPM in EVERY sample\ntpms <- txi$?? |>\n  as.data.frame() |>\n  rownames_to_column(var = \"ensembl_gene_id\")\n\n\ntpms.cutoff <-\n  mutate(tpms, nSamples = rowSums(tpms[, 2:??] > 1)) |>\n  filter(nSamples >= ??) |>\n  \n  dplyr::select(-nSamples)\n```\n:::\n\n\n\n## Q2: make correlation heatmap (5 pts)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Use cor() to get a matrix of pairwise correlations between samples\ntpms.cor <- cor(??, method = \"??\")\n\n#Use pheatmap() to plot correlation matrix\n\npheatmap(\n  ??,\n  annotation_col = metadata[,??], # what would be interesting to add as colored categories\n  fontsize = 7,\n  show_colnames = FALSE\n)\n```\n:::\n\n\n![HINT: what your answer should look like](/img/block-rna/tdp43_heatmap.png){width=\"75%\"}\n\n> Provide 1-2 sentences of interpretation of the similarity of the samples based on the heatmap.\n\n## Q3: make PCA plot (5 pts)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Start with the filtered TPM table from above\ntpms.cutoff.matrix <- tpms.cutoff |>\n  dplyr::select(-??) |>\n  as.??()\n\n\n#Use prcomp() to derive principle component coordinants of *LOGGED*  and *Scaled* TPM values\ntpms.cutoff.matrix <- log??(tpms.cutoff.matrix + ??)\n\n# scale\ntpms.cutoff.matrix <- ??(scale(??(tpms.cutoff.matrix)))\n\n\n# principle components\ntpms.pca <- prcomp(t(tpms.cutoff.matrix))\n\n#Add annotations of the cell compartment (soma / neurite) and TDP-43 status (WT / KO) of the samples\ntpms.pca.pc <- tpms.pca$x %>%\n  as.data.frame() %>%\n  rownames_to_column(var = \"sample_id\") %>% \n  left_join(., metadata[,c(1,??)], by = \"??\")\n\n## \n\ntpms.pca.summary <- summary(tpms.pca)$importance\npc1var <- round(tpms.pca.summary[2, 1] * 100, 1)\npc2var <- round(tpms.pca.summary[2, 2] * 100, 1)\n\n#Plot PCA data\n\nggplot(data = tpms.pca.pc,\n  aes(\n    x = PC1, y = PC2,\n    color = paste(??,??), label = sample_id\n  )\n) +\n  geom_point(size = 5) +\n  scale_color_brewer(palette = \"Set1\") +\n  theme_cowplot(16) +\n  labs(\n    x = paste(\"PC1,\", pc1var, \"% explained var.\"),\n    y = paste(\"PC2,\", pc2var, \"% explained var.\")\n  ) +\n  geom_text_repel()\n```\n:::\n\n\n![HINT: what your answer should look like](/img/block-rna/tdp43_pca.png){width=\"75%\"}\n\n> Provide 1-2 sentences of interpretation of the similarity of the samples based on the heatmap.\n",
+    "supporting": [
+      "ps-23_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],