-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f213cd1
commit d2a572f
Showing
8 changed files
with
359 additions
and
0 deletions.
There are no files selected for viewing
16 changes: 16 additions & 0 deletions
16
_freeze/problem-set-keys/ps-key-21/execute-results/html.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
{ | ||
"hash": "2294420d44d660d06133d6b75d5880a7", | ||
"result": { | ||
"markdown": "---\ntitle: \"DNA Block - Problem Set 21\"\n---\n\n\n## Problem Set\n\nIn this problem set you'll examine the binding sites of the Reb1 transcription factor by CUT&RUN.\n\nThe first several chunks just run, putting libraries and files in your environment\n\nEach question below is worth 4 points.\n\n### Setup \n\nStart by loading libraries you need analysis in the code chunk below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nlibrary(valr)\nlibrary(cowplot)\n\n# genome viz\nlibrary(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)\nlibrary(Gviz)\nlibrary(rtracklayer)\n\n# motif discovery and viz\nlibrary(BSgenome.Scerevisiae.UCSC.sacCer3)\nlibrary(rGADEM)\nlibrary(seqLogo)\n```\n:::\n\n\nNext, load the coverage tracks for CUT&RUN data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrack_start <- 1e5 \ntrack_end <- 2e5 \n\n# genes track\nsgd_genes_trk <-\n GeneRegionTrack(\n TxDb.Scerevisiae.UCSC.sacCer3.sgdGene,\n chromosome = \"chrII\",\n start = track_start,\n end = track_end,\n fill = \"#009E73\",\n background.title = \"white\",\n col.title = \"black\",\n fontsize = 14\n )\n\n# signal tracks\ntrack_info <-\n tibble(\n file_name = c(\n \"CutRun_Reb1_lt120.bw\",\n \"CutRun_Abf1_lt120.bw\",\n \"CutRun_Reb1_gt150.bw\",\n \"CutRun_Abf1_gt150.bw\"\n ),\n sample_type = c(\n \"Reb1_Short\", \"Abf1_Short\",\n \"Reb1_Long\", \"Abf1_Long\"\n )\n ) |>\n mutate(\n file_path = here(\"data/block-dna\", file_name),\n big_wig = purrr::map(\n file_path, ~ import.bw(.x, as = \"GRanges\")\n ),\n data_track = purrr::map2(\n big_wig, sample_type,\n ~ DataTrack(\n .x,\n name = .y,\n background.title = \"white\",\n col.title = \"black\",\n col.axis = \"black\",\n fontsize = 12\n )\n )\n ) |>\n dplyr::select(sample_type, big_wig, data_track)\n\n# x-axis track\nx_axis_trk <- GenomeAxisTrack(\n col = \"black\",\n col.axis = \"black\",\n fontsize = 16\n)\n```\n:::\n\n\nNow that we have tracks loaded, we can make a plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotTracks(\n c(\n sgd_genes_trk,\n track_info$data_track,\n x_axis_trk\n ),\n from = track_start,\n to = track_end,\n chromosome = \"chrII\",\n transcriptAnnotation = \"gene\",\n shape = \"arrow\",\n type = \"histogram\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-21_files/figure-html/plot-tracks-1.png){width=672}\n:::\n:::\n\n\n### Question 1\n\nHow do you interpret the differences in signals between the short and long fragments above? I.e.,\nwhere are the short fragments enriched? And where do you see more of the long fragments?\n\n> Short fragments generated from MNase digestion of transcription factor complexes are enriched\nin paromoter regions. Gene bodies contain more longer fragments, where nucleosomes are present\nand yield longer digestion products.\n\n### Question 2\n\nRemake the plot above, but zoom in to a promoter region that has strong enrichment\nfor both Reb1 and Abf1 (short fragments).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotTracks(\n c(\n sgd_genes_trk,\n track_info$data_track,\n x_axis_trk\n ),\n from = 160000,\n to = 170000,\n chromosome = \"chrII\",\n transcriptAnnotation = \"gene\",\n shape = \"arrow\",\n type = \"histogram\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-21_files/figure-html/plot-tracks-zoom-1.png){width=672}\n:::\n:::\n\n\nDo the signals for Reb1 and Abf1 line up with one another? Why is this the case? \n\n> In this region, the major peaks of Reb1 and Abf1 are near but do not overlap, consistent\nwith these factors requiring specific DNA sequences for binding.\n\n### Question 3\n\nNext we'll take a look at Reb1 CUT&RUN data. In the following chunk, use the approach\nwe took in class to identify enriched sites of Reb1 binding. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nreb1_tbl <- read_bigwig(here(\"data/block-dna/CutRun_Reb1_lt120.bw\"))\n\n# number of reads in the original Reb1 BED file\ntotal_reads <- 16e6\n\ngenome <- read_genome(here(\"data/block-dna/sacCer3.chrom.sizes\"))\n\n# how can we calculate genome size?\ngenome_size <- sum(genome$size)\n\n# defind the genome-wide lambda value here\ngenome_lambda <- total_reads / genome_size\n\npeak_calls <-\n reb1_tbl |>\n # define single-base sites\n mutate(\n midpoint = start + round((end - start) / 2),\n start = midpoint,\n end = start + 1,\n # use the poisson to calculate a p-value with the genome-wide lambda\n pval = dpois(score, genome_lambda),\n # convert p-values to FDR\n fdr = p.adjust(pval, method = \"fdr\")\n )\n```\n:::\n\n\nLet's take a look at a plot of the p-value across a chromosome.\n\nUse `geom_hline()` to draw a red horizontal line at a cutoff that selects ~10 regions \nenriched for Reb1 binding. You'll use this cutoff in the code chunks below. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(\n filter(peak_calls, chrom == \"chrII\"),\n # convert p-values to positive values for plotting\n aes(start, -log10(pval))\n) + \n geom_line() +\n xlim(track_start, track_end) +\n theme_cowplot() +\n geom_hline(yintercept = 20, color = 'red')\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 55930 rows containing missing values (`geom_line()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](ps-key-21_files/figure-html/plot-peak-sig-1.png){width=672}\n:::\n:::\n\n### Question 4\n\nHow many peaks are called in this region? Use the cutoff you defined above to identify \n\"peaks\" of Reb1 binding.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeak_calls |> \n filter(-log10(pval) >= 20) |>\n filter(start >= track_start & end <= track_end) |>\n bed_merge(max_dist = 20) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 261\n```\n\n\n:::\n:::\n\n\nHow many *total* peaks are identified in the genome using this cutoff?\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeak_calls |> \n filter(-log10(pval) >= 20) |>\n bed_merge(max_dist = 20) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2568\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# most stringent cut-off\npeak_calls_sig <- \n filter(\n peak_calls,\n -log10(pval) >= 20, \n ) |>\n # collapse neighboring, significant sites\n bed_merge(max_dist = 20)\n\nfilter(\n peak_calls_sig,\n chrom == \"chrII\" &\n start >= track_start &\n end <= track_end\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 28 x 3\n chrom start end\n <chr> <int> <int>\n 1 chrII 111281 111371\n 2 chrII 116933 117055\n 3 chrII 122242 122335\n 4 chrII 124885 125083\n 5 chrII 132274 132369\n 6 chrII 135587 135638\n 7 chrII 135696 136060\n 8 chrII 137335 137424\n 9 chrII 139151 139182\n10 chrII 142981 143030\n# i 18 more rows\n```\n\n\n:::\n:::\n\n\nLet's visualize these peaks in the context of genomic CUT&RUN signal. We need to define an `AnnotationTrack` with the peak intervals, which we can plot against the CUT&RUN coverage we defined above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# need a GRanges object to convert to an AnnotationTrack\npeak_calls_gr <-\n GRanges(\n seqnames = peak_calls_sig$chrom,\n ranges = IRanges(peak_calls_sig$start, peak_calls_sig$end)\n )\n\npeak_calls_trk <-\n AnnotationTrack(\n peak_calls_gr,\n name = \"Peak calls\",\n fill = \"red\",\n background.title = \"white\",\n col.title = \"red\",\n fontsize = 16,\n rotation.title = 0\n )\n\nreb1_short_trk <-\n filter(\n track_info,\n sample_type == \"Reb1_Short\"\n ) |>\n pull(data_track)\n\nplotTracks(\n c(\n sgd_genes_trk,\n reb1_short_trk,\n peak_calls_trk,\n x_axis_trk\n ),\n from = track_start,\n to = track_end,\n chromosome = \"chrII\",\n transcriptAnnotation = \"gene\",\n shape = \"arrow\",\n type = \"histogram\"\n)\n```\n\n::: {.cell-output-display}\n![](ps-key-21_files/figure-html/plot-peaks-1.png){width=672}\n:::\n:::\n\n\n### Question 5\n\nUse the peak calls you defined to identify a putative sequence motif bound by Reb1. You can\nassume that the most abundant motif identified is the most likely candidate.\n\nUse Google and Pubmed to identify a study that defines a Reb1 motif using a\ngenomewide analysis. How well does your motif match the previously defined one?\n\n> The identified Reb1 motif is a good match to several studies including PMID 19305407 and 28079019.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeak_seqs <- BSgenome::getSeq(\n # provided by BSgenome.Scerevisiae.UCSC.sacCer3\n Scerevisiae,\n peak_calls_gr\n)\n\n# takes ~2 minutes to run\ngadem <- rGADEM::GADEM(\n peak_seqs,\n genome = Scerevisiae,\n verbose = 1\n)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# look at the consensus motifs\nconsensus(gadem)\n```\n:::\n\n\n`[1] \"nmGGGTAAy\" \"nTTTTTTTTTTTn\" \"nCGGGrnCCssvmAsGrGn\" \"mmmmmmmmmmCCACACCCACACmC\" \"nTwwwTwwTwAhTwwTTATw\" \"nAmwAAwAhTAmATmAAAAn\" \"yTywCymTACAACwnTTyn\"`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# how many consensus motifs are there?\nnOccurrences(gadem)\n```\n:::\n\n\n`[1] 1438 541 290 315 234 201 240`\n\nNow let's look at the sequence logo for the top hit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npwm <- gadem@motifList[[1]]@pwm\n\nseqLogo::seqLogo(pwm)\n```\n:::\n\n\n![](../img/block-dna/reb1-seq-logo.png)\n", | ||
"supporting": [ | ||
"ps-key-21_files" | ||
], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,343 @@ | ||
--- | ||
title: "DNA Block - Problem Set 21" | ||
--- | ||
|
||
## Problem Set | ||
|
||
In this problem set you'll examine the binding sites of the Reb1 transcription factor by CUT&RUN. | ||
|
||
The first several chunks just run, putting libraries and files in your environment | ||
|
||
Each question below is worth 4 points. | ||
|
||
### Setup | ||
|
||
Start by loading libraries you need analysis in the code chunk below. | ||
|
||
```{r} | ||
#| label: load-libs | ||
#| message: false | ||
library(tidyverse) | ||
library(here) | ||
library(valr) | ||
library(cowplot) | ||
# genome viz | ||
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene) | ||
library(Gviz) | ||
library(rtracklayer) | ||
# motif discovery and viz | ||
library(BSgenome.Scerevisiae.UCSC.sacCer3) | ||
library(rGADEM) | ||
library(seqLogo) | ||
``` | ||
|
||
Next, load the coverage tracks for CUT&RUN data. | ||
|
||
```{r} | ||
#| label: make-tracks | ||
track_start <- 1e5 | ||
track_end <- 2e5 | ||
# genes track | ||
sgd_genes_trk <- | ||
GeneRegionTrack( | ||
TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, | ||
chromosome = "chrII", | ||
start = track_start, | ||
end = track_end, | ||
fill = "#009E73", | ||
background.title = "white", | ||
col.title = "black", | ||
fontsize = 14 | ||
) | ||
# signal tracks | ||
track_info <- | ||
tibble( | ||
file_name = c( | ||
"CutRun_Reb1_lt120.bw", | ||
"CutRun_Abf1_lt120.bw", | ||
"CutRun_Reb1_gt150.bw", | ||
"CutRun_Abf1_gt150.bw" | ||
), | ||
sample_type = c( | ||
"Reb1_Short", "Abf1_Short", | ||
"Reb1_Long", "Abf1_Long" | ||
) | ||
) |> | ||
mutate( | ||
file_path = here("data/block-dna", file_name), | ||
big_wig = purrr::map( | ||
file_path, ~ import.bw(.x, as = "GRanges") | ||
), | ||
data_track = purrr::map2( | ||
big_wig, sample_type, | ||
~ DataTrack( | ||
.x, | ||
name = .y, | ||
background.title = "white", | ||
col.title = "black", | ||
col.axis = "black", | ||
fontsize = 12 | ||
) | ||
) | ||
) |> | ||
dplyr::select(sample_type, big_wig, data_track) | ||
# x-axis track | ||
x_axis_trk <- GenomeAxisTrack( | ||
col = "black", | ||
col.axis = "black", | ||
fontsize = 16 | ||
) | ||
``` | ||
|
||
Now that we have tracks loaded, we can make a plot. | ||
|
||
```{r} | ||
#| label: plot-tracks | ||
plotTracks( | ||
c( | ||
sgd_genes_trk, | ||
track_info$data_track, | ||
x_axis_trk | ||
), | ||
from = track_start, | ||
to = track_end, | ||
chromosome = "chrII", | ||
transcriptAnnotation = "gene", | ||
shape = "arrow", | ||
type = "histogram" | ||
) | ||
``` | ||
|
||
### Question 1 | ||
|
||
How do you interpret the differences in signals between the short and long fragments above? I.e., | ||
where are the short fragments enriched? And where do you see more of the long fragments? | ||
|
||
> Short fragments generated from MNase digestion of transcription factor complexes are enriched | ||
in paromoter regions. Gene bodies contain more longer fragments, where nucleosomes are present | ||
and yield longer digestion products. | ||
|
||
### Question 2 | ||
|
||
Remake the plot above, but zoom in to a promoter region that has strong enrichment | ||
for both Reb1 and Abf1 (short fragments). | ||
|
||
```{r} | ||
#| label: plot-tracks-zoom | ||
plotTracks( | ||
c( | ||
sgd_genes_trk, | ||
track_info$data_track, | ||
x_axis_trk | ||
), | ||
from = 160000, | ||
to = 170000, | ||
chromosome = "chrII", | ||
transcriptAnnotation = "gene", | ||
shape = "arrow", | ||
type = "histogram" | ||
) | ||
``` | ||
|
||
Do the signals for Reb1 and Abf1 line up with one another? Why is this the case? | ||
|
||
> In this region, the major peaks of Reb1 and Abf1 are near but do not overlap, consistent | ||
with these factors requiring specific DNA sequences for binding. | ||
|
||
### Question 3 | ||
|
||
Next we'll take a look at Reb1 CUT&RUN data. In the following chunk, use the approach | ||
we took in class to identify enriched sites of Reb1 binding. | ||
|
||
```{r} | ||
#| label: peak-calling | ||
reb1_tbl <- read_bigwig(here("data/block-dna/CutRun_Reb1_lt120.bw")) | ||
# number of reads in the original Reb1 BED file | ||
total_reads <- 16e6 | ||
genome <- read_genome(here("data/block-dna/sacCer3.chrom.sizes")) | ||
# how can we calculate genome size? | ||
genome_size <- sum(genome$size) | ||
# defind the genome-wide lambda value here | ||
genome_lambda <- total_reads / genome_size | ||
peak_calls <- | ||
reb1_tbl |> | ||
# define single-base sites | ||
mutate( | ||
midpoint = start + round((end - start) / 2), | ||
start = midpoint, | ||
end = start + 1, | ||
# use the poisson to calculate a p-value with the genome-wide lambda | ||
pval = dpois(score, genome_lambda), | ||
# convert p-values to FDR | ||
fdr = p.adjust(pval, method = "fdr") | ||
) | ||
``` | ||
|
||
Let's take a look at a plot of the p-value across a chromosome. | ||
|
||
Use `geom_hline()` to draw a red horizontal line at a cutoff that selects ~10 regions | ||
enriched for Reb1 binding. You'll use this cutoff in the code chunks below. | ||
|
||
```{r} | ||
#| label: plot-peak-sig | ||
ggplot( | ||
filter(peak_calls, chrom == "chrII"), | ||
# convert p-values to positive values for plotting | ||
aes(start, -log10(pval)) | ||
) + | ||
geom_line() + | ||
xlim(track_start, track_end) + | ||
theme_cowplot() + | ||
geom_hline(yintercept = 20, color = 'red') | ||
``` | ||
### Question 4 | ||
|
||
How many peaks are called in this region? Use the cutoff you defined above to identify | ||
"peaks" of Reb1 binding. | ||
|
||
```{r} | ||
peak_calls |> | ||
filter(-log10(pval) >= 20) |> | ||
filter(start >= track_start & end <= track_end) |> | ||
bed_merge(max_dist = 20) |> | ||
nrow() | ||
``` | ||
|
||
How many *total* peaks are identified in the genome using this cutoff? | ||
|
||
```{r} | ||
peak_calls |> | ||
filter(-log10(pval) >= 20) |> | ||
bed_merge(max_dist = 20) |> | ||
nrow() | ||
``` | ||
|
||
```{r} | ||
#| label: n-peaks | ||
# most stringent cut-off | ||
peak_calls_sig <- | ||
filter( | ||
peak_calls, | ||
-log10(pval) >= 20, | ||
) |> | ||
# collapse neighboring, significant sites | ||
bed_merge(max_dist = 20) | ||
filter( | ||
peak_calls_sig, | ||
chrom == "chrII" & | ||
start >= track_start & | ||
end <= track_end | ||
) | ||
``` | ||
|
||
Let's visualize these peaks in the context of genomic CUT&RUN signal. We need to define an `AnnotationTrack` with the peak intervals, which we can plot against the CUT&RUN coverage we defined above. | ||
|
||
```{r} | ||
#| label: plot-peaks | ||
# need a GRanges object to convert to an AnnotationTrack | ||
peak_calls_gr <- | ||
GRanges( | ||
seqnames = peak_calls_sig$chrom, | ||
ranges = IRanges(peak_calls_sig$start, peak_calls_sig$end) | ||
) | ||
peak_calls_trk <- | ||
AnnotationTrack( | ||
peak_calls_gr, | ||
name = "Peak calls", | ||
fill = "red", | ||
background.title = "white", | ||
col.title = "red", | ||
fontsize = 16, | ||
rotation.title = 0 | ||
) | ||
reb1_short_trk <- | ||
filter( | ||
track_info, | ||
sample_type == "Reb1_Short" | ||
) |> | ||
pull(data_track) | ||
plotTracks( | ||
c( | ||
sgd_genes_trk, | ||
reb1_short_trk, | ||
peak_calls_trk, | ||
x_axis_trk | ||
), | ||
from = track_start, | ||
to = track_end, | ||
chromosome = "chrII", | ||
transcriptAnnotation = "gene", | ||
shape = "arrow", | ||
type = "histogram" | ||
) | ||
``` | ||
|
||
### Question 5 | ||
|
||
Use the peak calls you defined to identify a putative sequence motif bound by Reb1. You can | ||
assume that the most abundant motif identified is the most likely candidate. | ||
|
||
Use Google and Pubmed to identify a study that defines a Reb1 motif using a | ||
genomewide analysis. How well does your motif match the previously defined one? | ||
|
||
> The identified Reb1 motif is a good match to several studies including PMID 19305407 and 28079019. | ||
```{r} | ||
#| label: motif-discovery | ||
#| eval: false | ||
peak_seqs <- BSgenome::getSeq( | ||
# provided by BSgenome.Scerevisiae.UCSC.sacCer3 | ||
Scerevisiae, | ||
peak_calls_gr | ||
) | ||
# takes ~2 minutes to run | ||
gadem <- rGADEM::GADEM( | ||
peak_seqs, | ||
genome = Scerevisiae, | ||
verbose = 1 | ||
) | ||
``` | ||
|
||
|
||
```{r} | ||
#| eval: false | ||
# look at the consensus motifs | ||
consensus(gadem) | ||
``` | ||
|
||
`[1] "nmGGGTAAy" "nTTTTTTTTTTTn" "nCGGGrnCCssvmAsGrGn" "mmmmmmmmmmCCACACCCACACmC" "nTwwwTwwTwAhTwwTTATw" "nAmwAAwAhTAmATmAAAAn" "yTywCymTACAACwnTTyn"` | ||
|
||
```{r} | ||
#| eval: false | ||
# how many consensus motifs are there? | ||
nOccurrences(gadem) | ||
``` | ||
|
||
`[1] 1438 541 290 315 234 201 240` | ||
|
||
Now let's look at the sequence logo for the top hit. | ||
|
||
```{r} | ||
#| label: plot-logo | ||
#| eval: false | ||
pwm <- gadem@motifList[[1]]@pwm | ||
seqLogo::seqLogo(pwm) | ||
``` | ||
|
||
![](../img/block-dna/reb1-seq-logo.png) |