update info

rnabioco · Oct 1, 2023 · 91a5580 · 91a5580
1 parent f0a0617
commit 91a5580
Show file tree

Hide file tree

Showing 15 changed files with 59 additions and 27 deletions.
diff --git a/_freeze/exercises/ex-16/execute-results/html.json b/_freeze/exercises/ex-16/execute-results/html.json
@@ -1,8 +1,10 @@
 {
-  "hash": "9046622821e87511de2b486bc61ecb79",
+  "hash": "66a8a49a13c6ed3040807e83690a4fec",
   "result": {
-    "markdown": "---\ntitle: \"Mapping chromatin structure and transactions\"\nauthor: \"Your Name Here\"\n---\n\n\n## `bed_intersect()` example {.smaller}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(valr)\nlibrary(dplyr)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n\nAttaching package: 'dplyr'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe following objects are masked from 'package:stats':\n\n    filter, lag\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe following objects are masked from 'package:base':\n\n    intersect, setdiff, setequal, union\n```\n\n\n:::\n\n```{.r .cell-code}\nx <- tribble(\n  ~chrom, ~start, ~end,\n  \"chr1\", 25, 50,\n  \"chr1\", 100, 125\n)\n\ny <- tribble(\n  ~chrom, ~start, ~end,\n  \"chr1\", 30,     75\n)\n```\n:::\n\n\n## valr example\n\nYou can use `read_bed()` and related functions to load genome annotations and signals.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsnps <- read_bed(\n  valr_example(\"hg19.snps147.chr22.bed.gz\"),\n  n_fields = 6\n)\n\ngenes <- read_bed(\n  valr_example(\"genes.hg19.chr22.bed.gz\"),\n  n_fields = 6\n)\n```\n:::\n\n\n## What is in `snps` and `genes`? {.smaller}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Interval manipulation {.smaller}\n\nLet's find and characterize intergenic SNPs. We'll use the tools `bed_substract()` and `bed_closest()`. Take a look and their examples in the documentation to see what they do.\n\n\n::: {.cell}\n\n:::\n\n\n. . .\n\nTake a look at the `intergenic` and `nearby` objects in the console.\n\n## Interval manipulation {.smaller}\n\nNow that you have overlaps and distances between SNPs and genes, you can \ngo back to dplyr tools to generate reports.\n\n\n::: {.cell}\n\n:::\n\n\n## `bed_map()` example {.smaller}\n\nCopy / paste these into your console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- tribble(\n ~chrom, ~start, ~end,\n 'chr1', 100,    250,\n 'chr2', 250,    500\n)\n\ny <- tribble(\n ~chrom, ~start, ~end, ~value,\n 'chr1', 100,    250,  10,\n 'chr1', 150,    250,  20,\n 'chr2', 250,    500,  500\n)\n```\n:::\n\n\n## `bed_map()` example continued {.smaller}\n\nFirst examine the intersecting intervals.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n",
-    "supporting": [],
+    "markdown": "---\ntitle: \"Mapping chromatin structure and transactions\"\nauthor: \"Your Name Here\"\n---\n\n\n## `bed_intersect()` example {.smaller}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(valr)\nlibrary(dplyr)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n\nAttaching package: 'dplyr'\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:stats':\n\n    filter, lag\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:base':\n\n    intersect, setdiff, setequal, union\n```\n:::\n\n```{.r .cell-code}\nx <- tribble(\n  ~chrom, ~start, ~end,\n  \"chr1\", 25, 50,\n  \"chr1\", 100, 125\n)\n\ny <- tribble(\n  ~chrom, ~start, ~end,\n  \"chr1\", 30,     75\n)\n```\n:::\n\n\n## valr example\n\nYou can use `read_bed()` and related functions to load genome annotations and signals.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsnps <- read_bed(\n  valr_example(\"hg19.snps147.chr22.bed.gz\"),\n  n_fields = 6\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: The `n_fields` argument of `read_bed()` is deprecated as of valr 0.6.9.\nℹ fields are now determined automatically from the file\n```\n:::\n\n```{.r .cell-code}\ngenes <- read_bed(\n  valr_example(\"genes.hg19.chr22.bed.gz\"),\n  n_fields = 6\n)\n```\n:::\n\n\n## What is in `snps` and `genes`? {.smaller}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Interval manipulation {.smaller}\n\nLet's find and characterize intergenic SNPs. We'll use the tools `bed_substract()` and `bed_closest()`. Take a look and their examples in the documentation to see what they do.\n\n\n::: {.cell}\n\n:::\n\n\n. . .\n\nTake a look at the `intergenic` and `nearby` objects in the console.\n\n## Interval manipulation {.smaller}\n\nNow that you have overlaps and distances between SNPs and genes, you can \ngo back to dplyr tools to generate reports.\n\n\n::: {.cell}\n\n:::\n\n\n## `bed_map()` example {.smaller}\n\nCopy / paste these into your console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- tribble(\n  ~chrom, ~start, ~end,\n  \"chr1\", 100,    250,\n  \"chr2\", 250,    500\n)\n\ny <- tribble(\n  ~chrom, ~start, ~end, ~value,\n  \"chr1\", 100,    250,  10,\n  \"chr1\", 150,    250,  20,\n  \"chr2\", 250,    500,  500\n)\n```\n:::\n\n\n## `bed_map()` example continued {.smaller}\n\nFirst examine the intersecting intervals.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n",
+    "supporting": [
+      "ex-16_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],

diff --git a/_freeze/exercises/ex-18/execute-results/html.json b/_freeze/exercises/ex-18/execute-results/html.json
@@ -1,8 +1,10 @@
 {
-  "hash": "17932c6a2ad5f5c80fc3a5f06e45d0b0",
+  "hash": "4e4d286ef43731e3e6a19f15dedea312",
   "result": {
     "markdown": "---\ntitle: \"Chromatin accessibility II\"\nauthor: \"Jay Hesselberth\"\nexecute: \n  eval: false\n---\n\n\n## Genomewide chromatin analysis with metaplots and heatmaps\n\nLast class we saw what the different methods to profile chromatin\naccessibility can tell us about general chromatin structure and possible regulation\nat specific regions in a small portion of a chromosome.\n\nWe also want to make sure these conclusions are valid throughout the genome, and today\nwe'll discuss two approaches for this: metaplots and heatmaps.\n\n# Metaplots\n\nWe will examine aggregated profiles (metaplots) of genomewide data (e.g., MNase-seq) relative to \nrecurrent features in a genome, like transcription start sites.\n\n### Load libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nlibrary(cowplot)\n\n# `glue` is a handy library for plot annotations\nlibrary(glue)\n\nlibrary(valr)\nlibrary(ComplexHeatmap)\n```\n:::\n\n\n### Load data\n\nWe will examine aggreprofiles of all our data sets relative to the\ntranscription start site (TSS), where all the action seems to be happening:\n\nFirst, we need to load relevant files:\n\n-   `yeast_tss_chrII.bed.gz` contains transcription start sites (TSS) for genes on yeast chromosome 2.\n-   `sacCer3.chrom.sizes` contains the sizes of all yeast chromosomes, needed for some of the calculations we'll do. We'll grab this from the UCSC download site.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyeast_tss <- read_???(\n  here(\"data/block-dna/yeast_tss_chrII.bed.gz\")\n)\n\ngenome <- read_???(\"https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes\")\n```\n:::\n\n\n### Load signals\n\nNext we'll load bigWigs for the ATAC and MNase experiments, containing either short or long fragments.\n\nRecall that the information encoded in short and long fragments should be reflected in our interpretations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nacc_tbl <-\n  tibble(\n    file_name = c(\n      \"yeast_mnase_lt50.bw\",\n      \"yeast_mnase_134_160.bw\",\n      \"yeast_atac_lt120.bw\",\n      \"yeast_atac_gt120.bw\"\n    ),\n    sample_type = c(\n      \"MNase_Short\", \"MNase_Long\",\n      \"ATAC_Short\", \"ATAC_Long\"\n    )\n  ) |>\n  mutate(\n    file_path = here(\"data/block-dna\", file_name),\n    big_wig = purrr::___(\n      file_path, ~ read_bigwig(.x)\n    )\n  ) |>\n  select(sample_type, big_wig)\n```\n:::\n\n\n## Setting up regions for a meta-plot\n\nWe need to set up some windows for analyzing signal relative to each TSS.\nThis is an important step that will ultimately impact our interpretations.\n\nIn genomic meta-plots, you first decide on a window size relevant to the\nsome points of refernce in the genome you are measuring. \n\nThese reference points could be:\n\n- transcription start or end sites\n- boundaries of exons and introns\n- enhancers\n- centromeres and telomeres\n\nThe window size should be relevant the reference points, such that small- or\nlarge-scale features are emphasized in the plot. Moreover, the window typically\nspans some distance both up- and downstream of the reference points.\n\nOnce the window size has been decided, the next step is to make \"sub-windows\"\naround a reference point. If gene features are involved, we also need to take\nstrand into account.\n\n(The state of genome annotation directly influences the quality of the meta-plot\nor heatmap.)\n\nFor small features like transcription factor binding sites (8-20 bp), you might\nset up smaller windows (maybe 1 bp) at a distance \\~20 bp up- and downstream of\na reference point.\n\nFor larger features like nucleosome positions or chromatin domains, you might\nset up larger windows (\\~200 bp) at distances up to \\~10 kbp up- and downstream\nof a set of reference points.\n\n## Metaplot workflow\n\n![Metaplot workflow overview](../img/block-dna/metaplot-workflow.png)\n\n## Chromatin accessibility around transcription start sites (TSSs)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregion_size_total <- 1500\nregion_size_half <- region_size_total / 2\nwin_size <- 10\n\n# need a function that generates a sequence of numbers\nwin_coords <- ___(\n  -region_size_half,\n  region_size_half,\n  win_size\n)\n```\n:::\n\n\nNext, we'll use two valr functions to expand the window of the reference\npoint (`bed_slop()`) and then break those windows into evenly spaced intervals \n(`bed_makewindows()`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyeast_tss |>\n  bed_???(genome, both = region_size_half) |>\n  bed_???(win_size = win_size)\n```\n:::\n\n\nAt this point, we also address the fact that the TSS data are stranded. Can\nsomeone describe what the issue is here, based on the figure above?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntss_win_tbl <-\n  yeast_tss |>\n  bed_slop(genome, both = region_size_half) |>\n  bed_makewindows(win_size = win_size) |>\n  mutate(\n    win_coord = case_when(\n      strand == \"-\" ~ rev(win_coords),\n      .default = win_coords\n    ),\n    .by = name\n  ) |>\n  select(-.win_id, -score, -strand)\n```\n:::\n\n\nThis next step uses valr `bed_map()`, to calculate the total signal for each\nwindow by intersecting signals from the bigWig files.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nacc_tbl <-\n  acc_tbl |>\n  mutate(\n    tss_win_sum = purrr::map(\n      big_wig,\n      ~ bed_map(\n        tss_win_tbl,\n        .x,\n        win_signal = sum(score)\n      )\n    )\n  )\n```\n:::\n\n\nOnce we have the values from `bed_map()`, we can group by `win_coord` and\ncalculate a summary statistic for each window.\n\nRemember that `win_coord` is the same relative position for each TSS, so these\nnumbers represent a composite signal a the same position across all TSS.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntss_meta_tbl <-\n  select(acc_tbl, sample_type, tss_win_sum) |>\n  unnest(cols = c(tss_win_sum)) |>\n  summarize(\n    win_mean = mean(win_signal, na.rm = TRUE),\n    win_sd = sd(win_signal, na.rm = TRUE),\n    .by = c(win_coord, sample_type)\n  ) |>\n  replace_na(list(win_mean = 0, win_sd = 0))\n```\n:::\n\n\n## Meta-plot of signals around TSSs\n\nFinally, let's plot the data relative to TSS for each of the windows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_tss <- length(unique(yeast_tss$name))\n\nggplot(\n  tss_meta_tbl,\n  aes(\n    x = win_coord,\n    y = win_mean\n  )\n) +\n  geom_line(linewidth = 1, color = \"red\") +\n  facet_wrap(\n    ~sample_type,\n    nrow = 2,\n    scales = \"free_y\"\n  ) +\n  theme_minimal_grid() +\n  theme(\n    axis.text.x = element_text(\n      angle = 45,\n      vjust = 1,\n      hjust = 1\n    )\n  ) +\n  labs(\n    x = \"Position relative to TSS (bp)\",\n    y = \"Signal (mean of window sums)\",\n    title = \"Chromatin accessibility around transcription start sites\",\n    subtitle = glue(\"{n_tss} features on S. cerevisiae chrII\")\n  )\n```\n:::\n\n\n## Interpreting the meta-plots \n\n- What is the direction of transcription in these meta-plots?\n\n- What are the features of chromatin near TSS measured by these different experimental conditions?\n\n- How do you interpret the increased signal of the +1 nucleosome in the \"MNase_Long\" condition, relative to e.g. -1, +2, +3, etc.?\n\n- What are the differences in ATAC and MNase treatments that lead to these distinctive patterns?\n\n# Heatmaps\n\n## Heatmap of signals around TSSs\n\nTo generate a heatmap, we need to reformat our data slightly.\n\nTake a look at `acc_tbl` and think about how you might reorganize the following way:\n\n- rows contain the data for individual loci (i.e., each TSS)\n- columns are ordered positions relative to the TSS (i.e., most upstream to most downstream)\n\nWe're going to plot a heatmap of the \"MNase_Long\" data. There are two ways\nto get these data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmnase_tbl <- acc_tbl[acc_tbl$sample_type == \"MNase_Short\", ]$tss_win_sum[[1]]\n```\n:::\n\n\nOr, using dplyr / tidyr:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmnase_tbl <-\n  filter(\n    acc_tbl,\n    sample_type == \"MNase_Long\"\n  ) |>\n    select(-big_wig) |>\n    unnest(cols = c(tss_win_sum))\n```\n:::\n\n\nEither way, now we need to reformat the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmnase_tbl_wide <-\n  mnase_tbl |>\n  select(\n    name,\n    win_coord,\n    win_signal\n  ) |>\n  # this step is important to sort + and - windows correctly\n  arrange(name, win_coord) |>\n  replace_na(\n    list(win_signal = 0)\n  ) |>\n  pivot_???(\n    id_cols = name,\n    names_from = \"win_coord\",\n    values_from = \"win_signal\"\n  )\n```\n:::\n\n\nOnce we have the data reformatted, we just convert to a matrix and feed it to\n`ComplexHeatmap::Heatmap()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# drop the name & convert to matrix\nmnase_mtx <-\n  select(mnase_tbl_wide, -name) |>\n  as.___()\n\nComplexHeatmap::___(\n  mnase_mtx,\n  cluster_columns = FALSE,\n  show_row_dend = FALSE,\n  show_column_names = FALSE,\n  show_heatmap_legend = FALSE\n)\n```\n:::\n\n\n## Interpreting meta-plots and heatmaps\n\nIt's worth considering what meta-plots and heatmaps *can* and *can't* tell you.\n\n1.  What are the similarities and differences between heatmaps and meta-plots?\n\n2.  What types of conclusions can you draw from each type of plot?\n\n3.  What are some features of MNase-seq and ATAC-seq that become more clear when\nlooking across many loci at the same time?\n\n4.  What are some hypotheses you can generate based on these plots?\n\n## Assessing the signfiicance of observed patterns.\n\nOne approach for assessing the signifcance of a pattern (e.g., the density\nof nucleosomes around transcription start sites), is to randomize the reference\npoints, and use those to remeasure density. This can be done by generating\na random set of TSS like:\n\n``` r\nn_tss <- length(unique(yeast_tss$name))\nbed_random(genome, length = 1, n = n_tss)\n```\n\nGiven enough random sites, you will be able to vizulaized what a null distribution of overlaps looks like. If you reframe your question around a response variable (i.e., \"how many intact nucleosomes are within 500 bp of a TSS?\") then you can calcuate an empirical p-value using the approaches covered in the bootcamp.\n\n## Utility of metaplots and heatmaps in practice\n\nThere are a few specific cases when you should consider using the metaplot /\nheatmap strategy.\n\nIf you have data from two different conditions (i.e., plus/minus a transcription\nfactor that might change the state of multiple promoters), you could make a\nmetaplot for each of the plus and minus conditions, to see of common patterns\nmanifest. You might also create a heatmap of the conditions, which might reveal\nsmaller groups of affected genes that would be masked by the metaplot\naggregation.\n\nThe approaches are also useful to compare patterns across types of reference points.\nFor example, you might examine chromatin marks (measured by CUT&TAG) relative to\nthe following: transcription starts, exon starts, exon ends, transcription ends.\n\n  \n  \n\n",
-    "supporting": [],
+    "supporting": [
+      "ex-18_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],

diff --git a/_freeze/exercises/ex-22/execute-results/html.json b/_freeze/exercises/ex-22/execute-results/html.json
@@ -0,0 +1,14 @@
+{
+  "hash": "a265f2b802dbddd5f42560553b4b5757",
+  "result": {
+    "markdown": "---\ntitle: \"ex 22\"\nsubtitle: \"RNA-sequencing intro\"\nauthor: \"Neelanjan Mukherjee\"\n---\n\n\n\n\n\n## Examine count data {.smaller}\n\n\n::: {.cell output-location='column-fragment'}\n\n```{.r .cell-code}\nd <- read_csv(here(\"data\",\"unfilt_counts.csv.gz\")) |> as.matrix()\n\n\ndf <- tibble(variance=??,\n                mean=??)\n\nggplot(??) +\n        geom_point(aes(x=??, y=??)) + \n        scale_y_log10(limits = c(1,1e9)) +\n        scale_x_log10(limits = c(1,1e9)) +\n        geom_abline(intercept = 0, slope = 1, color=\"red\") +\n  theme_cowplot()\n```\n:::\n\n\n## estimateSizeFactors {.smaller}\n\n\n\n::: {.cell output-location='fragment'}\n\n```{.r .cell-code}\nd <- read_csv(here(\"data\",\"unfilt_counts.csv.gz\")) |> as.matrix()\n\n# estimate size factors\n```\n:::\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}