Skip to content

Commit

Permalink
ps 18 key
Browse files Browse the repository at this point in the history
  • Loading branch information
jayhesselberth committed Sep 29, 2023
1 parent a1f4a1a commit 1781cb2
Show file tree
Hide file tree
Showing 4 changed files with 258 additions and 0 deletions.
16 changes: 16 additions & 0 deletions _freeze/problem-set-keys/ps-key-18/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"hash": "b10915bcd5064a495aaa94e0e8b3d589",
"result": {
"markdown": "---\ntitle: \"DNA Block - Problem Set 18\"\n---\n\n\n## Problem Set\n\nTotal points: 20. First problem is worth 7 points, second problem is worth 13 points.\n\n## Load libraries\n\nStart by loading libraries you need analysis in the code chunk below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --\nv dplyr 1.1.3 v readr 2.1.4\nv forcats 1.0.0 v stringr 1.5.0\nv ggplot2 3.4.3 v tibble 3.2.1\nv lubridate 1.9.2 v tidyr 1.3.0\nv purrr 1.0.2 \n-- Conflicts ------------------------------------------ tidyverse_conflicts() --\nx dplyr::filter() masks stats::filter()\nx dplyr::lag() masks stats::lag()\ni Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nhere() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(cowplot)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n\nAttaching package: 'cowplot'\n\nThe following object is masked from 'package:lubridate':\n\n stamp\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(valr)\n```\n:::\n\n\nLoad the data from the MNase-seq experiment.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# XXX: blanks here\nmnase_tbl <- read_bed(\n here(\"data/block-dna/yeast_mnase_chrII.bed.gz\")\n)\n```\n:::\n\n\nIn class we learned that MNase digestion yields nucleosomal \"footprints\" of \\~150 bp in size. I've added blue vertical lines to emphasize positions of the major peak (intact nucleosomes) as well as smaller \"sub-nucleosomal\" peak.\n\n![](img/ps-17-mnase-histogram.png)\n\nWe can calculate the counts for the histogram above and more precisely determine the maximum signal using `which.max()` to identify the *index* of the maximum value in a vector (*not* the value itself!):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfrag_hist <-\n mnase_tbl |>\n mutate(frag_len = end - start) |>\n count(frag_len)\n\n# `which.max` takes a vector and gives us the index of the maximum value\nmax_idx <- which.max(frag_hist$n)\n\n# now we can use index to find the abundant fragment size\nncp_max <- frag_hist$frag_len[max_idx]\n```\n:::\n\n\nSo this tells us that that the most abundant fragment size in the library is 149 bp.\n\n## Question 1 -- 7 points\n\nLet's take a closer look at the some of the smaller fragments in the MNase experiment. In particular, let's zoom in on the populations of fragments that are smaller than 1 nucleosome in size, the peak between 85 and 95 bp (the left-most blue vertical line).\n\n1. Use the above strategy to precisely determine the peak of this smaller size range. How big are those fragments? These are called \"sub-nucleosomal\" fragments.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfrag_hist_sub <- filter(frag_hist, frag_len > 85 & frag_len < 95)\nn_max_sub <- which.max(frag_hist_sub$n)\nsubnuc_max <- frag_hist_sub$frag_len[n_max_sub]\n```\n:::\n\n\n> The population of fragments is at 92 bp.\n\n2. Do this one more time, and identify the position of maximum signal in the disome peak (i.e., the fragments protected by two nucleosomes).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfrag_hist_di <- filter(frag_hist, frag_len > 280 & frag_len < 320)\nn_max_di <- which.max(frag_hist_di$n)\ndinuc_max <- frag_hist_di$frag_len[n_max_di]\n```\n:::\n\n\n3. Recreate the histogram using ggplot2 (using relevant code from class 17) and add the blue vertical lines at the peak positions you calculated, including the position of the disomes above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n mnase_tbl,\n aes(x = end - start)\n) +\n geom_histogram(\n # single base-pair resolution\n binwidth = 1\n ) +\n labs(\n x = \"fragment length (bp)\",\n title = \"Histogram of fragment lengths from MNase-seq\"\n ) +\n theme_cowplot() +\n geom_vline(xintercept = c(subnuc_max, ncp_max, dinuc_max), color = \"blue\")\n```\n\n::: {.cell-output-display}\n![](ps-key-18_files/figure-html/plot-frag-hist-1.png){width=672}\n:::\n:::\n\n\n## Question 2 -- 13 points\n\nNext we're going to look at *where* these sub-sucleosomes are with respect to intact nucleosomes.\n\nOur strategy will be to compare the density of sub-nucleosomes relative to the mid-points of previously mapped nucleosomes. Specifically, our reference point will be the midpoints of the +1 nucleosomes.\n\nSo you'll make a metaplot, but instead of using transcription start sites as the reference point, we'll use the midpoints of the +1 nucleosome, and instead of MNase-seq signal density, you'll count up the number of individual reads that intersect with windows around those midpoints.\n\n1. First, load the relevant data. We'll re-use the `yeast_mnase_chrII.bed.gz` data you loaded above, plus you'll need to load two other files:\n\n - a \"genome\" file, `sacCer3.chrom.sizes`\n - a BED file, `yeast_p1_chrII.bed.gz` which contains the mid-points of the +1 nucleosomes on chromosome 22. Recall that the +1 nucleosome is the nucleosome downstream of the transcription start site.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# XXX: blanks here\ngenome <- read_genome(here(\"data/block-dna/sacCer3.chrom.sizes\"))\n# XXX: blanks here\np1_tbl <- read_bed(here(\"data/block-dna/yeast_p1_chrII.bed.gz\"))\n```\n:::\n\n\n2. Next, we need the mid-points of nucleosomes for comparison. The following function needs fixing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalc_mids <- function(tbl, min_len, max_len) {\n tbl |>\n mutate(\n # XXX: blanks here\n frag_len = end - start\n ) |>\n filter(\n # XXX: blanks here\n frag_len >= min_len & frag_len <= max_len\n ) |>\n mutate(\n # XXX: blanks here\n midpoint = start + round((end - start) / 2)\n ) |>\n select(chrom, midpoint) |>\n rename(start = midpoint) |>\n mutate(end = start + 1)\n}\n```\n:::\n\n\n3. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nncp_mids_tbl <-\n calc_mids(mnase_tbl, ncp_max - 3, ncp_max + 3) |>\n bed_slop(genome, both = 20)\n\nsubnuc_mids_tbl <-\n calc_mids(mnase_tbl, subnuc_max - 3, subnuc_max + 3) |>\n bed_slop(genome, both = 20)\n```\n:::\n\n\n4. Next, we need to make the intervals for the metaplot. We'll look 100 bp up- and downstream of the +1 nucleosome positions, and make windows that are 1 bp in size.\n\n\n::: {.cell}\n\n```{.r .cell-code}\np1_win_tbl <-\n bed_slop(\n p1_tbl,\n genome,\n both = 100\n ) |>\n # XXX: blank here\n bed_makewindows(win_size = 1)\n```\n:::\n\n\n5. Almost there! Now you just need to identify the number of short and long nuclesome fragments (based on their midpoints) that intersect with the +1 nucleosomes you defined above.\n\n Use `bed_intersect()` to identify fragments that overlap, and then just count the number of fragments per `.win_id` (don't forget the suffix). Note you will do this separately for the short and long fragments.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nncp_mids_summary_tbl <-\n bed_intersect(\n p1_win_tbl,\n ncp_mids_tbl\n ) |>\n count(.win_id.x) |>\n mutate(type = \"Intact nucleosomes (~149 bp)\")\n\nsubnuc_mids_summary_tbl <-\n bed_intersect(\n p1_win_tbl,\n subnuc_mids_tbl\n ) |>\n count(.win_id.x) |>\n mutate(type = \"Sub-nucleosomes (~90 bp)\")\n```\n:::\n\n\n6. The following joins the tables you made together, and makes the x-axis more informative, by converting to position rather than window ID.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwin_ids <- seq(-100, 100, 1)\n\nall_tbl <- bind_rows(\n ncp_mids_summary_tbl,\n subnuc_mids_summary_tbl\n) |>\n mutate(win_ids = win_ids, .by = \"type\")\n```\n:::\n\n\n7. Finally, we plot the data with position on the x-axis, and count on the y-axis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n all_tbl,\n aes(win_ids, n)\n) +\n geom_line(\n linewidth = 1,\n color = \"red\"\n ) +\n facet_wrap(\n ~ type,\n scales = \"free_y\"\n ) +\n theme_minimal_grid() +\n labs(\n x = \"Position relative to +1 nucleosome midpoints\",\n y = \"Number of intersecting fragments\",\n title = \"Fragment density around +1 nucleosome midpoints\"\n )\n```\n\n::: {.cell-output-display}\n![](ps-key-18_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\n## Interpretation\n\nHow do you interpret these plots?\n\nRationalize the pattern for intact nucleosomes. What pattern did you expect to see?\n\n> Expectation is that large nucleosomal fragments will be centered on +1 nucleosome midpoints, so the plot agrees. The width of the density has to do with the size of the intervals we defined (expanding single-base pair midpoints by 20 and 100 bp (for query and reference intervals, respectively).\n\nRationalize the pattern for sub-nucleosome. How would you describe the positions of sub-nucleosomal fragments, relative to the +1 nucleosome midpoints? What might this mean with respect to gene transcription?\n\n> The density is bimodal, indicating two distributions of sub-nucleosomal fragments relative to +1 nucleosome midpoints. These fragments are formed at the edges of the nucleosome core particl during the process of nucleosome disassembly during transcription (losing a histone dimer from one side or the other, see Ramachandran et al. in the block resources document).\n\nWhat do the differences between signal magnitudes (reflected by the y-axis) mean?\n\n> There are more positioned +1 nucleosomes then there are sub-nucleosomal intermediates, which can be rationalized by the fact that only a subset of genes are on / being transcribed in the cells at the time of the experiment.\n\n## Submit\n\nBe sure to click the \"Render\" button to render the HTML output. Then paste the URL of this Posit Cloud project into the problem set on Canvas.\n",
"supporting": [
"ps-key-18_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
242 changes: 242 additions & 0 deletions problem-set-keys/ps-key-18.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: "DNA Block - Problem Set 18"
---

## Problem Set

Total points: 20. First problem is worth 7 points, second problem is worth 13 points.

## Load libraries

Start by loading libraries you need analysis in the code chunk below.

```{r}
#| label: load-libs
library(tidyverse)
library(here)
library(cowplot)
library(valr)
```

Load the data from the MNase-seq experiment.

```{r}
#| label: load-mnase-bed
# XXX: blanks here
mnase_tbl <- read_bed(
here("data/block-dna/yeast_mnase_chrII.bed.gz")
)
```

In class we learned that MNase digestion yields nucleosomal "footprints" of \~150 bp in size. I've added blue vertical lines to emphasize positions of the major peak (intact nucleosomes) as well as smaller "sub-nucleosomal" peak.

![](img/ps-17-mnase-histogram.png)

We can calculate the counts for the histogram above and more precisely determine the maximum signal using `which.max()` to identify the *index* of the maximum value in a vector (*not* the value itself!):

```{r}
#| label: frag-hist
frag_hist <-
mnase_tbl |>
mutate(frag_len = end - start) |>
count(frag_len)
# `which.max` takes a vector and gives us the index of the maximum value
max_idx <- which.max(frag_hist$n)
# now we can use index to find the abundant fragment size
ncp_max <- frag_hist$frag_len[max_idx]
```

So this tells us that that the most abundant fragment size in the library is `r ncp_max` bp.

## Question 1 -- 7 points

Let's take a closer look at the some of the smaller fragments in the MNase experiment. In particular, let's zoom in on the populations of fragments that are smaller than 1 nucleosome in size, the peak between 85 and 95 bp (the left-most blue vertical line).

1. Use the above strategy to precisely determine the peak of this smaller size range. How big are those fragments? These are called "sub-nucleosomal" fragments.

```{r}
#| label: frag-hist-sub
frag_hist_sub <- filter(frag_hist, frag_len > 85 & frag_len < 95)
n_max_sub <- which.max(frag_hist_sub$n)
subnuc_max <- frag_hist_sub$frag_len[n_max_sub]
```

> The population of fragments is at `r subnuc_max` bp.
2. Do this one more time, and identify the position of maximum signal in the disome peak (i.e., the fragments protected by two nucleosomes).

```{r}
#| label: frag-hist-disome
frag_hist_di <- filter(frag_hist, frag_len > 280 & frag_len < 320)
n_max_di <- which.max(frag_hist_di$n)
dinuc_max <- frag_hist_di$frag_len[n_max_di]
```

3. Recreate the histogram using ggplot2 (using relevant code from class 17) and add the blue vertical lines at the peak positions you calculated, including the position of the disomes above.

```{r}
#| label: plot-frag-hist
ggplot(
mnase_tbl,
aes(x = end - start)
) +
geom_histogram(
# single base-pair resolution
binwidth = 1
) +
labs(
x = "fragment length (bp)",
title = "Histogram of fragment lengths from MNase-seq"
) +
theme_cowplot() +
geom_vline(xintercept = c(subnuc_max, ncp_max, dinuc_max), color = "blue")
```

## Question 2 -- 13 points

Next we're going to look at *where* these sub-sucleosomes are with respect to intact nucleosomes.

Our strategy will be to compare the density of sub-nucleosomes relative to the mid-points of previously mapped nucleosomes. Specifically, our reference point will be the midpoints of the +1 nucleosomes.

So you'll make a metaplot, but instead of using transcription start sites as the reference point, we'll use the midpoints of the +1 nucleosome, and instead of MNase-seq signal density, you'll count up the number of individual reads that intersect with windows around those midpoints.

1. First, load the relevant data. We'll re-use the `yeast_mnase_chrII.bed.gz` data you loaded above, plus you'll need to load two other files:

- a "genome" file, `sacCer3.chrom.sizes`
- a BED file, `yeast_p1_chrII.bed.gz` which contains the mid-points of the +1 nucleosomes on chromosome 22. Recall that the +1 nucleosome is the nucleosome downstream of the transcription start site.

```{r}
#| label: load-data-2
# XXX: blanks here
genome <- read_genome(here("data/block-dna/sacCer3.chrom.sizes"))
# XXX: blanks here
p1_tbl <- read_bed(here("data/block-dna/yeast_p1_chrII.bed.gz"))
```

2. Next, we need the mid-points of nucleosomes for comparison. The following function needs fixing.

```{r}
#| label: calc-midpoints
calc_mids <- function(tbl, min_len, max_len) {
tbl |>
mutate(
# XXX: blanks here
frag_len = end - start
) |>
filter(
# XXX: blanks here
frag_len >= min_len & frag_len <= max_len
) |>
mutate(
# XXX: blanks here
midpoint = start + round((end - start) / 2)
) |>
select(chrom, midpoint) |>
rename(start = midpoint) |>
mutate(end = start + 1)
}
```

3.

```{r}
ncp_mids_tbl <-
calc_mids(mnase_tbl, ncp_max - 3, ncp_max + 3) |>
bed_slop(genome, both = 20)
subnuc_mids_tbl <-
calc_mids(mnase_tbl, subnuc_max - 3, subnuc_max + 3) |>
bed_slop(genome, both = 20)
```

4. Next, we need to make the intervals for the metaplot. We'll look 100 bp up- and downstream of the +1 nucleosome positions, and make windows that are 1 bp in size.

```{r}
p1_win_tbl <-
bed_slop(
p1_tbl,
genome,
both = 100
) |>
# XXX: blank here
bed_makewindows(win_size = 1)
```

5. Almost there! Now you just need to identify the number of short and long nuclesome fragments (based on their midpoints) that intersect with the +1 nucleosomes you defined above.

Use `bed_intersect()` to identify fragments that overlap, and then just count the number of fragments per `.win_id` (don't forget the suffix). Note you will do this separately for the short and long fragments.

```{r}
ncp_mids_summary_tbl <-
bed_intersect(
p1_win_tbl,
ncp_mids_tbl
) |>
count(.win_id.x) |>
mutate(type = "Intact nucleosomes (~149 bp)")
subnuc_mids_summary_tbl <-
bed_intersect(
p1_win_tbl,
subnuc_mids_tbl
) |>
count(.win_id.x) |>
mutate(type = "Sub-nucleosomes (~90 bp)")
```

6. The following joins the tables you made together, and makes the x-axis more informative, by converting to position rather than window ID.

```{r}
win_ids <- seq(-100, 100, 1)
all_tbl <- bind_rows(
ncp_mids_summary_tbl,
subnuc_mids_summary_tbl
) |>
mutate(win_ids = win_ids, .by = "type")
```

7. Finally, we plot the data with position on the x-axis, and count on the y-axis.

```{r}
ggplot(
all_tbl,
aes(win_ids, n)
) +
geom_line(
linewidth = 1,
color = "red"
) +
facet_wrap(
~ type,
scales = "free_y"
) +
theme_minimal_grid() +
labs(
x = "Position relative to +1 nucleosome midpoints",
y = "Number of intersecting fragments",
title = "Fragment density around +1 nucleosome midpoints"
)
```

## Interpretation

How do you interpret these plots?

Rationalize the pattern for intact nucleosomes. What pattern did you expect to see?

> Expectation is that large nucleosomal fragments will be centered on +1 nucleosome midpoints, so the plot agrees. The width of the density has to do with the size of the intervals we defined (expanding single-base pair midpoints by 20 and 100 bp (for query and reference intervals, respectively).
Rationalize the pattern for sub-nucleosome. How would you describe the positions of sub-nucleosomal fragments, relative to the +1 nucleosome midpoints? What might this mean with respect to gene transcription?

> The density is bimodal, indicating two distributions of sub-nucleosomal fragments relative to +1 nucleosome midpoints. These fragments are formed at the edges of the nucleosome core particl during the process of nucleosome disassembly during transcription (losing a histone dimer from one side or the other, see Ramachandran et al. in the block resources document).
What do the differences between signal magnitudes (reflected by the y-axis) mean?

> There are more positioned +1 nucleosomes then there are sub-nucleosomal intermediates, which can be rationalized by the fact that only a subset of genes are on / being transcribed in the cells at the time of the experiment.
## Submit

Be sure to click the "Render" button to render the HTML output. Then paste the URL of this Posit Cloud project into the problem set on Canvas.

0 comments on commit 1781cb2

Please sign in to comment.