Skip to content

Commit d507a08

Browse files
committed
Updated readme and signif function
1 parent 1e122f4 commit d507a08

File tree

6 files changed

+760
-9
lines changed

6 files changed

+760
-9
lines changed

R/signif_eigen_windows.R

+19-3
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
#' @param eigen_res A list of results from eigen_analyse_vectors(). This can be easily created by combining eigen_analyse_vectors with lapply. Each element of the input list should correspond to the eigen decomposition results for a given genome window.
55
#' @param cutoffs Numeric vector of alpha thresholds for null distribution cutoffs. These must be equal in length to the number of eigenvalues, such that the first element is the threshold for eigenvalue 1, the second for eigevalue 2 e.t.c.
66
#'
7+
#' @importFrom stats p.adjust
8+
#'
79
#' @return A list of character vectors, where each element of the list represents the significant windows associated with each eigenvalue.
810
#' @export
911
signif_eigen_windows <- function(eigen_res = NULL,
@@ -12,6 +14,17 @@ signif_eigen_windows <- function(eigen_res = NULL,
1214
if(is.numeric(cutoffs) == FALSE){
1315
stop("Error: cutoffs should be a numeric vector with a cutoff per eigenvector")
1416
}
17+
# if(!(is.null(pval_alpha)) & is.null(null_vectors)){
18+
# stop("Error: If using p-values, please provide null_vectors for calculation")
19+
# }
20+
21+
# # Calculate pvalues if we are using them
22+
# if(!(is.null(pval_alpha))){
23+
# tmp_pvals <- eigen_pvals(eigen_res,null_vectors)
24+
# for(i in 1:ncol(tmp_pvals)){
25+
# tmp_pvals[,i] <- p.adjust(tmp_pvals[,i],method=p.adjust.method)
26+
# }
27+
# }
1528

1629
# Fetch sums
1730
eigen_sums <- sum_eigenvals(eigen_res)
@@ -20,8 +33,11 @@ signif_eigen_windows <- function(eigen_res = NULL,
2033
eigenvec_out <- lapply(1:ncol(eigen_sums),function(vec){
2134

2235
# Fetch the significant
23-
signif_windows <- rownames(eigen_sums[eigen_sums[,vec] > cutoffs[vec],,drop=FALSE])
24-
36+
# if(is.null(pval_alpha)){
37+
signif_windows <- rownames(eigen_sums[eigen_sums[,vec] > cutoffs[vec],,drop=FALSE])
38+
# } else {
39+
# signif_windows <- rownames(tmp_pvals[tmp_pvals[,vec] <= pval_alpha,])
40+
# }
2541
# Return
2642
if(length(signif_windows) > 0){
2743
return(signif_windows)
@@ -36,6 +52,6 @@ signif_eigen_windows <- function(eigen_res = NULL,
3652
}
3753

3854
# Set names
39-
names(eigenvec_out) <- paste0("Eigenvector ",1:length(cutoffs))
55+
names(eigenvec_out) <- paste0("Eigenvector ",1:length(eigenvec_out))
4056
return(eigenvec_out)
4157
}

README.Rmd

+205-6
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22
title: "README"
33
author: "James Whiting"
44
date: "14/04/2021"
5-
output: html_document
5+
output:
6+
md_document:
7+
variant: markdown_github
68
---
79

810
```{r setup, include=FALSE}
@@ -14,6 +16,11 @@ knitr::opts_chunk$set(echo = TRUE)
1416

1517
The aim of this software is to quantify and compare parallel genotype change based on allele frequency changes between population pairs. The software produces a series of allele frequency matrices with n (number of snps per window) columns and m (number of population pairs) rows. Within each matrix, we perform eigen analysis to quantify skew in the eigenvalue distribution. An eigenvalue distribution with elevated eigenvalues on the first eigenvectors is indicative of a common trajectory through multivariate space for more than one population pair (ie. parallel or antiparallel genotype change in multiple populations). Multiple eigenvectors with elevated eigenvalues is indicative of multiple parallel/antiparallel axes within a genomic region, highlighting nonparallelism.
1618

19+
### Load from development dir
20+
```{r}
21+
devtools::load_all("~/Exeter/afvaper/")
22+
```
23+
1724
### Parallelism Definitions:
1825
Parallel - Change along the same axis in the same direction.
1926
Antiparallel - Change along the same axis in the opposite direction.
@@ -23,7 +30,7 @@ Nonparallel - Change along an alternative axis.
2330
### Make the inputs
2431
The software runs in R and requires the following inputs:
2532

26-
* A vcfR object of a chromosome, eg:
33+
* A vcfR object of a chromosome, eg:
2734
```{r}
2835
library(vcfR)
2936
vcf_path <- "~/Exeter/VCFs/five_aside_STAR_3033083_final.vcf.gz"
@@ -39,7 +46,7 @@ system(paste0("bcftools view -r ",chr," ",vcf_path," > ",vcf_tempfile), wait=TRU
3946
vcf_input_chr1 <- read.vcfR(vcf_tempfile,verbose = F)
4047
vcf_input_chr1
4148
```
42-
* A popmap, with individuals in column 1 and populations in column 2, eg.
49+
* A popmap, with individuals in column 1 and populations in column 2, eg.
4350
```{r}
4451
# We can fetch individuals from the vcfR object
4552
popmap <- data.frame(inds = colnames(vcf_input_chr1@gt)[2:ncol(vcf_input_chr1@gt)])
@@ -52,8 +59,200 @@ popmap$pop <- gsub('_M', '', popmap$pop)
5259
# Show popmap
5360
head(popmap)
5461
```
55-
* A list of vectors describing parallel replicates, eg.
62+
* A list of vectors describing parallel replicates. These are based on population identifiers in the popmap, and in all cases the first and second population should be in the same direction, i.e. for Ecotype1 vs Ecotype2, all first populations should be Ecotype1 and all second populations should be Ecotype2. Provide names for each replicate for downstream summaries. To build the vector list:
63+
```{r}
64+
# View populations in the popmap
65+
unique(popmap$pop)
66+
67+
# Build vector list
68+
vector_list <- list(c("LT","UT"),
69+
c("GH","GL"),
70+
c("APHP","APLP"),
71+
c("LO","UQ"),
72+
c("LMD","UMD"))
73+
74+
# Name vectors
75+
names(vector_list) <- c("Tacarigua","Guanapo","Aripo","Oropuche","Madamas")
76+
77+
# Show vector list
78+
vector_list
79+
```
80+
### Calculate Allele Frequency Change Vectors
81+
In order to analyse our VCF, we need to transform the inputs into a list allele frequency changes matrices, where each row is the normalised allele frequency change vector for one of our replicates, for N SNPs (where N is our window size). This is handled with the `calc_AF_vectors()` function:
82+
```{r}
83+
# Set our window size
84+
window_snps = 200
85+
86+
# Calculate Allele Frequency Change Vector Matrices
87+
AF_input <- calc_AF_vectors(vcf = vcf_input_chr1,
88+
window_size = window_snps,
89+
popmap = popmap,
90+
vectors = vector_list,
91+
n_cores = 4)
92+
93+
# Show features of input...
94+
print(paste0("Number of windows = ",length(AF_input)))
95+
print(paste0("Number of SNPs per window = ",ncol(AF_input[[1]])))
96+
print(paste0("Number of vectors per window = ",nrow(AF_input[[1]])))
97+
```
98+
99+
We also want to build a set of null vectors, in which the population assignment is shuffled among individuals and new vectors are calculated using the same structure as the originals. We do this using the same function, but pass a `null_perms` value to describe how many random vectors to calculate:
100+
101+
```{r}
102+
# How many permutations to run
103+
null_perm_N = 1000
104+
105+
# Calculate Allele Frequency Change Vector Matrices
106+
null_input <- calc_AF_vectors(vcf = vcf_input_chr1,
107+
window_size = window_snps,
108+
popmap = popmap,
109+
vectors = vector_list,
110+
n_cores = 4,
111+
null_perms = null_perm_N)
112+
113+
# Show features of input...
114+
print(paste0("Number of null windows = ",length(null_input)))
115+
print(paste0("Number of SNPs per window = ",ncol(null_input[[1]])))
116+
print(paste0("Number of vectors per window = ",nrow(null_input[[1]])))
117+
```
118+
119+
Note here that 1000 null permutations is on the low size, and a value closer to 10,000+ would be better. In practise, given it is easiest to run these functions over separate chromosomes, we can aim for a total of 10,000 null vectors from across all chromosomes and combine null vectors calculated on individual chromosomes. The easiest way to do this is to divide the desired total of 10,000 (or more) up between chromosomes based on the relative sizes (this info is available in a genome fasta index for e.g.). If we store each per-chromosome set of `null_input` vectors to a list, we can easily merge these into a single list with: `merged_null_input <- unlist(all_chr_null_input_list,recursive=FALSE)`. This `merged_null_input` can then be used as detailed below.
120+
121+
### Perform Eigen Analysis Over Allele Frequency Matrices
122+
Taking the `AF_input` data, we can perform eigen analyses over all matrices with a single command:
123+
```{r}
124+
# Perform eigen analysis
125+
eigen_res <- lapply(AF_input,eigen_analyse_vectors)
126+
```
127+
128+
The `eigen_res` output is a list containing, for each matrix, the eigenvalue distribution, the eigenvector loadings, and the projected A matrix that shows per-SNP scores for each eigenvector. The chromosomal regions are stored in `names(eigen_res)`:
129+
```{r}
130+
# View chromosomal regions:
131+
head(names(eigen_res))
132+
```
133+
```{r}
134+
# View eigenvalue distribution of first matrix
135+
eigen_res[[1]]$eigenvals
136+
```
137+
```{r}
138+
# View eigenvector loadings of first matrix
139+
eigen_res[[1]]$eigenvecs
140+
```
141+
```{r}
142+
# View head of SNP scores
143+
head(eigen_res[[1]]$A_matrix)
144+
```
145+
146+
### Merge results from different chromosomes
147+
Having calculated `eigen_res` across different chromosomes, we can then combine these by listing them out and reducing down the list:
148+
149+
```
150+
eigen_res_list <- list(eigen_res_chr1,eigen_res_chr2,eigen_res_chr3)
151+
allchr_eigen_res <- merge_eigen_res(eigen_res_list)
152+
```
153+
154+
If running over multiple chromosomes, it is recommended to do merge results across chromosomes before proceeding onto analysis...
155+
156+
### Find Null Cutoffs
157+
Using our `null_input`, we can output a matrix containing the cutoff expectations from the null distribution for each eigenvector for various significance thresholds:
158+
```{r}
159+
# Get cutoffs for 95%, 99% and 99.9%
160+
null_cutoffs <- find_null_cutoff(null_input,cutoffs = c(0.95,0.99,0.999))
161+
null_cutoffs
162+
```
163+
164+
Here, the values are summed through eigenvalues, such that the value for Eigenvector 2 is the sum of Eigenvalues 1 + 2, and the maximum value is equivalent to the number of replicates (`length(vector_list)`).
165+
166+
### Calculate p-values
167+
We can also calculate one-tailed p-values by comparing our observed eigenvalues to the null distribution. These are conceptually similar to using cutoffs, i.e. anything above the 95% cutoff outlined above will have a p-value < 0.05, however they can be useful for removing noise and highlighting peaks. These are calculated using the `eigen_res` results and the `null_input` vectors, and as output we get p-values for each eigenvector and window.
168+
```{r}
169+
# Calculate p-vals
170+
pvals <- eigen_pvals(eigen_res,null_input)
171+
172+
# Show lowest pvals
173+
head(pvals)
174+
```
175+
176+
Note: Because these p-values are calculated by comparing to the null distribution, they are bounded by the number of permutations. For e.g. if you run 1000 permuations, the lowest possible p-value for an observed value that exceeds all 1000 of the null values is simply (0+1/1000+1) which is ~ 0.001. If we ran 10,000 permutations and again had an observed eigenvalue greater than all 10,000 null values, this would be ~ 0.0001 e.t.c.
177+
178+
Consider this when interpreting p-values.
179+
180+
### Plot Eigenvalues Along Chromosomes
181+
We can plot the eigenvalues for each chromosome, either plotting the raw eigenvalues or the associated p-values. The output of `eigenval_plot()` gives a figure for each eigenvector:
182+
```{r}
183+
# Plot the raw eigenvalues, and visualise the cutoff of 99%
184+
all_plots <- eigenval_plot(eigen_res,cutoffs = null_cutoffs[,"99%"])
185+
186+
# Show the plots for eigenvalue 1 and 2
187+
cowplot::plot_grid(all_plots[[1]],all_plots[[2]],ncol=1)
188+
```
189+
190+
Alternatively, we can plot the p-values:
191+
```{r}
192+
# Plot the raw eigenvalues, and visualise the cutoff of 99%
193+
all_plots_p <- eigenval_plot(eigen_res,null_vectors = null_input,plot.pvalues = T)
194+
195+
# Show the plots for eigenvalue 1 and 2
196+
cowplot::plot_grid(all_plots_p[[1]],all_plots_p[[2]],ncol=1)
197+
```
198+
199+
Note here, each of the plots is a ggplot object that can be extracted and edited however you like, for e.g. we can remove the title and change the colour with ease using standard ggplot syntax:
200+
```{r}
201+
# Pull the figure
202+
eig1_pval_fig <- all_plots_p[[1]]
203+
204+
# Edit
205+
eig1_pval_fig + theme(title = element_blank()) + geom_step(colour="blue")
206+
```
207+
208+
### Pull Significant Windows
209+
Using the `eigen_res` results and `null_cutoffs`, we can identify windows that exceed the null cutoffs and return a list of genome windows. Here, we will use a made-up set of null cutoffs just to demonstrate the format of the output here:
210+
```{r}
211+
# Make dummy cutoffs
212+
dummy_cutoffs <- c(2.5,3.5,4.8,4.9,5.0)
213+
214+
# Find dummy significant windows
215+
dummy_significant_windows <- signif_eigen_windows(eigen_res,dummy_cutoffs)
216+
217+
# Display 'outliers'
218+
dummy_significant_windows[1:4]
219+
```
220+
221+
### Summarise Outliers
222+
To make sense of outliers, we want to summarise various information within candidate windows. This includes: which of our replicate vectors are associated with the relevant eigenvector (i.e. which vectors load above a certain threshold onto eigenvector 1); and whether vectors are associated in the same (parallel) or different (antiparallel) direction.
223+
224+
We also want to know for genome regions that exhibited elevated eigenvalues on eigenvectors 2+, what is happening on all preceeding eigenvectors. This is because we are always summing the eigenvalues, and as such a large eigenvalue 2 sum reflects that most allele frequency variance is captured by the first 2 eigenvectors.
225+
226+
The `summarise_window_parallelism()` function takes this into account, and returns all of this information into an output table for interpretation. The user defines the loading cutoff (which is an absolute value):
56227
```{r}
57-
vector_list <- list(c("GH","GL"),
58-
)
228+
# Summarise parallel evolution in windows that are significant on eigenvector 1
229+
eig1_parallel <- summarise_window_parallelism(window_id = "chr1:12164434-12213430",
230+
eigen_res = eigen_res,
231+
loading_cutoff = 0.3,
232+
eigenvector = 1)
233+
# Show results
234+
eig1_parallel
59235
```
236+
237+
This output therefore tells us how many replicates are associated with the first eigenvector at this genomic region (using the `names(vector_list)` provided earlier), and lists which of these are parallel or anti-parallel.
238+
239+
If we want to look at multiple windows, we can simply loop over them, for e.g. here for eigenvector 2 outliers:
240+
```{r}
241+
eig2_parallel=NULL
242+
for(window in dummy_significant_windows[[2]]){
243+
244+
# Summarise
245+
window_tmp <- summarise_window_parallelism(window_id = window,
246+
eigen_res = eigen_res,
247+
loading_cutoff = 0.3,
248+
eigenvector = 2)
249+
250+
# Bind
251+
eig2_parallel <- rbind(eig2_parallel,window_tmp)
252+
}
253+
254+
# Show results
255+
eig2_parallel
256+
```
257+
258+
Here we have the same information, but with 2 rows per window for each eigenvector up to the focal eigenvector (in this case set as `eigenvector = 2`).

0 commit comments

Comments
 (0)