Skip to content

Commit

Permalink
make functions faster
Browse files Browse the repository at this point in the history
  • Loading branch information
mijangos81 committed Jun 30, 2021
1 parent 17ee531 commit 34a37c0
Show file tree
Hide file tree
Showing 2 changed files with 166 additions and 194 deletions.
275 changes: 114 additions & 161 deletions R/gl.filter.parent.offspring.r
Original file line number Diff line number Diff line change
@@ -1,217 +1,170 @@
#' Filter putative parent offspring within a population
#' @name gl.filter.parent.offspring
#'
#' This script removes individuals suspected of being related as parent-offspring.
#' It examines the frequency of pedigree inconsistent loci, that is,
#' @title Filter putative parent offspring within a population
#'
#' @description
#' This script removes individuals suspected of being related as parent-offspring,
#' using the output of the function \code{\link{gl.report.parent.offspring}}, which
#' examines the frequency of pedigree inconsistent loci, that is,
#' those loci that are homozygotes in the parent for the reference allele, and
#' homozygous in the offspring for the alternate allele. This condition is not
#' consistent with any pedigree, regardless of the (unkonwn) genotype of the other
#' consistent with any pedigree, regardless of the (unknown) genotype of the other
#' parent. The pedigree inconsistent loci are counted as an indication of whether
#' or not it is reasonable to propose the two individuals are in a parent-offspring
#' relationship.
#'
#' Obviously, if the two individuals are in a parent offspring relationship, the true
#'
#' @param x Name of the genlight object containing the SNP genotypes [required]
#' @param min.rdepth Minimum read depth to include in analysis [default = 12]
#' @param min.reproducibility Minimum reproducibility to include in analysis [default = 1]
#' @param range Specifies the range to extend beyond the interquartile range for delimiting outliers [default = 1.5 interquartile ranges]
#' @param rm.monomorphs If TRUE, remove monomorphic loci after filtering individuals [default FALSE].
#' @param plot_theme Theme for the plot. See Details for options [default theme_dartR()].
#' @param plot_colours List of two color names for the borders and fill of the
#' plots [default two_colors].
#' @param verbose Verbosity: 0, silent or fatal errors; 1, begin and end; 2, progress log ; 3, progress and results summary; 5, full report [default NULL, unless specified using gl.set.verbosity]
#'
#' @details
#' If two individuals are in a parent offspring relationship, the true
#' number of pedigree inconsistent loci should be zero, but SNP calling is not
#' infallible. Some loci will be mis-called. The problem thus becomes one of determining
#' infallible. Some loci will be miss-called. The problem thus becomes one of determining
#' if the two focal individuals have a count of pedigree inconsistent loci less than
#' would be expected of typical unrelated individuals. There are some quite sophisticated
#' software packages available to formally apply likelihoods to the decision, but we
#' use a simple outlier comparison.
#'
#' To reduce the frequency of mis-calls, and so emphasise the difference between true
#' To reduce the frequency of miss-calls, and so emphasize the difference between true
#' parent-offspring pairs and unrelated pairs, the data can be filtered on read depth.
#' Typically minimum read depth is set to 5x, but you can examine the distribution
#' of read depths with gl.report.rdepth() and push this up with an acceptable loss
#' of loci. 12x might be a good minimum for this particular analysis. It is sensible
#' also to push the minimum reproducibility up to 1, if that does not result in an
#' unacceptable loss of loci.
#' of read depths with the function \code{\link{gl.report.rdepth}} and push this up
#' with an acceptable loss of loci. 12x might be a good minimum for this particular
#' analysis. It is sensible also to push the minimum reproducibility up to 1, if
#' that does not result in an unacceptable loss of loci. Reproducibility is stored
#' in the slot \code{@other$loc.metrics$RepAvg} and is defined as the proportion
#' of technical replicate assay pairs for which the marker score is consistent.
#' You can examine the distribution of reproducibility with the function
#' \code{\link{gl.report.reproducibility}}.
#'
#' Note that the null expectation is not well defined, and the power reduced, if the
#' population from which the putative parent-offspring pairs are drawn contains
#' many sibs. Note also that if an individual has been genotyped twice in the dataset,
#' the replicate pair will be assessed by this script as being in a parent-offspring
#' relationship.
#'
#' You should run gl.report.parent.offspring() before filtering. Use this report to
#' You should run \code{\link{gl.report.parent.offspring}} before filtering. Use this report to
#' decide min.rdepth and min.reproducibility and assess impact on your dataset.
#'
#' Note that if your dataset does not contain RepAvg or rdepth among the locus metrics,
#' the filters for reproducibility and read depth are no used.
#'
#'\strong{ Function's output }
#'
#' @param x Name of the genlight object containing the SNP genotypes [required]
#' @param min.rdepth Minimum read depth to include in analysis [default = 12]
#' @param min.reproducibility Minimum reproducibility to include in analysis [default = 1]
#' @param range -- specifies the range to extend beyond the interquartile range for delimiting outliers [default = 1.5 interquartile ranges]
#' @param rm.monomorphs -- if TRUE, remove monomorphic loci after filtering individuals [default FALSE]
#' @param verbose -- verbosity: 0, silent or fatal errors; 1, begin and end; 2, progress log ; 3, progress and results summary; 5, full report [default 2 or as specified using gl.set.verbosity]
#' @return A set of individuals in parent-offspring relationship
#' Plots and table are saved to the temporal directory (tempdir) and can be accessed with the function \code{\link{gl.print.reports}} and listed with the function \code{\link{gl.list.reports}}. Note that they can be accessed only in the current R session because tempdir is cleared each time that the R session is closed.
#'
#' Examples of other themes that can be used can be consulted in \itemize{
#' \item \url{https://ggplot2.tidyverse.org/reference/ggtheme.html} and \item
#' \url{https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/}
#' }
#'
#' @return the filtered genlight object without A set of individuals in parent-offspring relationship. NULL if no parent-offspring relationships were found.
#'
#' @author Arthur Georges (Post to \url{https://groups.google.com/d/forum/dartr})
#'
#' @examples
#' out <- gl.filter.parent.offspring(testset.gl[1:10])
#'
#' @seealso \code{\link{gl.list.reports}}, \code{\link{gl.report.rdepth}} ,
#' \code{\link{gl.print.reports}},\code{\link{gl.report.reproducibility}},
#' \code{\link{gl.report.parent.offspring}}
#'
#' @family filter functions
#'
#' @importFrom stats median IQR
#'
#' @import patchwork
#'
#' @export
# @rawNamespace import(ggplot2, except = empty)
#'

gl.filter.parent.offspring <- function(x,
min.rdepth=12,
min.reproducibility=1,
range=1.5,
rm.monomorphs=FALSE,
plot_theme = theme_dartR(),
plot_colours = two_colors,
verbose=NULL) {

# TRAP COMMAND, SET VERSION
# TRAP COMMAND

funname <- match.call()[[1]]
build <- "Jacob"

# SET VERBOSITY
# SET VERBOSITY

if (is.null(verbose)){
if(!is.null(x@other$verbose)){
verbose <- x@other$verbose
} else {
verbose <- 2
}
}
verbose <- gl.check.verbosity(verbose)

if (verbose < 0 | verbose > 5){
cat(paste(" Warning: Parameter 'verbose' must be an integer between 0 [silent] and 5 [full report], set to 2\n"))
verbose <- 2
}
# CHECKS DATATYPE

datatype <- utils.check.datatype(x)

# FUNCTION SPECIFIC ERROR CHECKING

# FLAG SCRIPT START

if (verbose >= 1){
if(verbose==5){
cat("Starting",funname,"[ Build =",build,"]\n")
if (verbose >= 1) {
if (verbose == 5) {
cat(report("\n\nStarting", funname, "[ Build =",
build, "]\n\n"))
} else {
cat("Starting",funname,"\n")
cat(report("\n\nStarting", funname, "\n\n"))
}
}

# STANDARD ERROR CHECKING

if(class(x)!="genlight") {
stop("Fatal Error: genlight object required!\n")
}

if (all(x@ploidy == 1)){
stop("Fatal Error: Detected Presence/Absence (SilicoDArT) data, require SNP data\n")
} else if (all(x@ploidy == 2)){
if (verbose >= 2){cat(" Processing a SNP dataset\n")}
} else {
stop("Fatal Error: Ploidy must be universally 1 (fragment P/A data) or 2 (SNP data)!")
}

# SCRIPT SPECIFIC ERROR CHECKING

# DO THE JOB

# Hold the genlight object
hold <- x

# Generate null expectation for pedigree inconsistency, and outliers
if (verbose >= 2){
cat(" Generating null expectation for distribution of counts of pedigree incompatability\n")
}
# Assign individuals as populations
x <- gl.reassign.pop(x,as.pop="id",verbose=0)
# Filter stringently on reproducibility to minimize miscalls
x <- gl.filter.reproducibility(x,threshold=min.reproducibility,verbose=0)
# Filter stringently on read depth, to further minimize miscalls
x <- gl.filter.rdepth(x,lower=min.rdepth,verbose=0)
# Filter on call rate to reduce computation time
x <- gl.filter.callrate(x,threshold = 0.95,plot=FALSE,verbose=0)
# Preliminaries before for loops
nL <- nLoc(x)
nP <- nPop(x)
pN <- popNames(x)
count <- array(NA,dim=c(nP,nP))
row.names(count) <- popNames(x)
colnames(count) <- popNames(x)
# For pairs of individuals
for (i in 1:(nP-1)){
for (j in (i+1):nP){
pair <- gl.keep.pop(x,pop.list=c(pN[i],pN[j]),verbose=0)
mat <- as.matrix(pair)
vect <- mat[1,]*10+mat[2,]
homalts <- sum(vect==2 | vect==20, na.rm=T)
count[i,j] <- homalts
}
}
# Remove NAs
counts <- count[!is.na(count)]
# DO THE JOB

# Prepare for plotting

if (verbose >= 2){
cat( " Identifying outliers with lower than expected counts of pedigree inconsistencies\n")
}
title <- paste0("SNP data (DArTSeq)\nCounts of pedigree incompatable loci per pair")
# Save the prior settings for mfrow, oma, mai and pty, and reassign
op <- par(mfrow = c(2, 1), oma=c(1,1,1,1), mai=c(0.5,0.5,0.5,0.5),pty="m")
# Set margins for first plot
par(mai=c(1,0.5,0.5,0.5))

# Plot Box-Whisker plot
if (verbose >= 2) {cat(" Standard boxplot, no adjustment for skewness\n")}
whisker <- boxplot(counts, horizontal=TRUE, col='steelblue', range=range, main = title)
lower.extremes <- whisker$out[whisker$out < median(counts)]
if (length(lower.extremes)==0){
outliers <- NULL
} else {
outliers <- data.frame(Outlier=lower.extremes)
}
outliers <- gl.report.parent.offspring(x,
min.rdepth=min.rdepth,
min.reproducibility=min.reproducibility,
range=range,
plot_theme = plot_theme,
plot_colours = plot_colours,
verbose=verbose)
# Remove the outliers

# Ascertain the identity of the pairs
if (verbose >= 2){
cat( " Identifying outlying pairs\n")
ind_to_remove <- unique(outliers$ind1)
if(length(ind_to_remove)>0){
x <- gl.drop.ind(x,ind.list = ind_to_remove,verbose=verbose)
if (rm.monomorphs==TRUE){
x <- gl.filter.monomorphs(x,verbose=verbose)
}
tmp <- count
tmp[lower.tri(tmp)] = t(tmp)[lower.tri(tmp)]
for (i in 1:length(outliers$Outlier)){
# Identify
tmp2 <- tmp[tmp==outliers$Outlier[i]]
outliers$ind1[i] <- popNames(x)[!is.na(tmp2)][1]
outliers$ind2[i] <- popNames(x)[!is.na(tmp2)][2]
# Z-scores
zscore <- (mean(count,na.rm=TRUE)-outliers$Outlier[i])/sd(count,na.rm=TRUE)
outliers$zscore[i] <- round(zscore,2)
outliers$p[i] <- round(pnorm(mean=mean(count,na.rm=TRUE),sd=sd(count,na.rm=TRUE),q=outliers$zscore[i]),4)
# REPORT THE RESULTS
if(verbose>=2){
cat(" \nInitial number of individuals:",nInd(x),"\n")
cat(" Pairs of individuals in a parent offspring relationship:\n\n")
print(outliers)
cat(" \nIndividuals removed: ")
cat(ind_to_remove, sep='\n')
cat("\n")
}
}

# Extract the quantile threshold
iqr <- IQR(counts,na.rm = TRUE)
qth <- quantile(counts,0.25,na.rm=TRUE)
cutoff <- qth - iqr*range

# Set margins for second plot
par(mai=c(0.5,0.5,0,0.5))
#case no
if (length(outliers)==0) {
if(verbose>0){
cat(important("No individuals were found to be in parent offspring relationship, therefore the genlight object is returned unchanged.\n"))
}
}

# Plot Histogram
hist(counts, xlab="No. Pedigree Incompatable", col='steelblue',breaks=100, main=NULL)
abline(v=cutoff, col="red")

# Output the outlier loci
if (length(whisker$out)==0){
if (verbose >= 3){
cat(" No outliers detected\n")
}
} else {
outliers <- outliers[order(outliers$Outlier),]
if (verbose >= 3){
#cat(" Outliers detected -- \n")
print(outliers)
}
}

# Reset the par options
par(op)
# ADD ACTION TO HISTORY

# Remove the outliers
nh <- length(x@other$history)
x@other$history[[nh + 1]] <- match.call()

hold <- gl.drop.ind(hold,ind.list = unique(outliers$ind1),verbose=5)
if (rm.monomorphs==TRUE){
hold <- gl.filter.monomorphs(hold)
}
# FLAG SCRIPT END

# FLAG SCRIPT END

if (verbose > 0) {
cat("Completed:",funname,"\n")
if (verbose >= 1) {
cat(report("\n\nCompleted:", funname, "\n\n"))
}

# RETURN

return(x)

Expand Down
Loading

0 comments on commit 34a37c0

Please sign in to comment.