-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathgl.report.parent.offspring.r
295 lines (268 loc) · 11.9 KB
/
gl.report.parent.offspring.r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
#' @name gl.report.parent.offspring
#' @title Identifies putative parent offspring within a population
#' @description
#' This script examines the frequency of pedigree inconsistent loci, that is,
#' those loci that are homozygotes in the parent for the reference allele, and
#' homozygous in the offspring for the alternate allele. This condition is not
#' consistent with any pedigree, regardless of the (unknown) genotype of the
#' other parent. The pedigree inconsistent loci are counted as an indication of
#' whether or not it is reasonable to propose the two individuals are in a
#' parent-offspring relationship.
#' @param x Name of the genlight object containing the SNP genotypes [required].
#' @param min.rdepth Minimum read depth to include in analysis [default 12].
#' @param min.reproducibility Minimum reproducibility to include in analysis
#' [default 1].
#' @param range Specifies the range to extend beyond the interquartile range for
#' delimiting outliers [default 1.5 interquartile ranges].
#' @param plot.out Creates a plot that shows the sex linked markers
#' [default TRUE].
#' @param plot_theme Theme for the plot. See Details for options
#' [default theme_dartR()].
#' @param plot_colors List of two color names for the borders and fill of the
#' plots [default two_colors].
#' @param save2tmp If TRUE, saves any ggplots and listings to the session
#' temporary directory (tempdir) [default FALSE].
#' @param verbose Verbosity: 0, silent or fatal errors; 1, begin and end; 2,
#' progress log; 3, progress and results summary; 5, full report
#' [default 2, unless specified using gl.set.verbosity].
#' @details
#' If two individuals are in a parent offspring relationship, the true number of
#' pedigree inconsistent loci should be zero, but SNP calling is not infallible.
#' Some loci will be miss-called. The problem thus becomes one of determining
#' if the two focal individuals have a count of pedigree inconsistent loci less
#' than would be expected of typical unrelated individuals. There are some quite
#' sophisticated software packages available to formally apply likelihoods to
#' the decision, but we use a simple outlier comparison.
#'
#' To reduce the frequency of miss-calls, and so emphasize the difference
#' between true parent-offspring pairs and unrelated pairs, the data can be
#' filtered on read depth.
#'
#' Typically minimum read depth is set to 5x, but you can examine the
#' distribution of read depths with the function \code{\link{gl.report.rdepth}}
#' and push this up with an acceptable loss of loci. 12x might be a good minimum
#' for this particular analysis. It is sensible also to push the minimum
#' reproducibility up to 1, if that does not result in an unacceptable loss of
#' loci. Reproducibility is stored in the slot \code{@other$loc.metrics$RepAvg}
#' and is defined as the proportion of technical replicate assay pairs for which
#' the marker score is consistent. You can examine the distribution of
#' reproducibility with the function \code{\link{gl.report.reproducibility}}.
#'
#' Note that the null expectation is not well defined, and the power reduced, if
#' the population from which the putative parent-offspring pairs are drawn
#' contains many sibs. Note also that if an individual has been genotyped twice
#' in the dataset, the replicate pair will be assessed by this script as being
#' in a parent-offspring relationship.
#'
#' The function \code{\link{gl.filter.parent.offspring}} will filter out those
#' individuals in a parent offspring relationship.
#'
#' Note that if your dataset does not contain RepAvg or rdepth among the locus
#' metrics, the filters for reproducibility and read depth are no used.
#'
#'\strong{ Function's output }
#'
#' Plots and table are saved to the temporal directory (tempdir) and can be
#' accessed with the function \code{\link{gl.print.reports}} and listed with
#' the function \code{\link{gl.list.reports}}. Note that they can be accessed
#' only in the current R session because tempdir is cleared each time that the
#' R session is closed.
#'
#' Examples of other themes that can be used can be consulted in \itemize{
#' \item \url{https://ggplot2.tidyverse.org/reference/ggtheme.html} and \item
#' \url{https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/}
#' }
#' @return A set of individuals in parent-offspring relationship. NULL if no
#' parent-offspring relationships were found.
#' @author Custodian: Arthur Georges (Post to
#' \url{https://groups.google.com/d/forum/dartr})
#' @examples
#' out <- gl.report.parent.offspring(testset.gl[1:10])
#' @seealso \code{\link{gl.list.reports}}, \code{\link{gl.report.rdepth}} ,
#' \code{\link{gl.print.reports}},\code{\link{gl.report.reproducibility}},
#' \code{\link{gl.filter.parent.offspring}}
#' @family reporting functions
#' @importFrom stats median IQR
#' @import patchwork
#' @export
gl.report.parent.offspring <- function(x,
min.rdepth = 12,
min.reproducibility = 1,
range = 1.5,
plot.out = TRUE,
plot_theme = theme_dartR(),
plot_colors = two_colors,
save2tmp = FALSE,
verbose = NULL) {
# SET VERBOSITY
verbose <- gl.check.verbosity(verbose)
# FLAG SCRIPT START
funname <- match.call()[[1]]
utils.flag.start(func = funname,
build = "Jody",
verbosity = verbose)
# CHECK DATATYPE
datatype <- utils.check.datatype(x, verbose = verbose)
# DO THE JOB
# Generate null expectation for pedigree inconsistency, and outliers
if (verbose >= 2) {
cat(
report(
" Generating null expectation for distribution of counts of pedigree incompatibility\n"
)
)
}
# Assign individuals as populations
pop(x) <- x$ind.names
# Filter stringently on reproducibility to minimize miscalls
if (is.null(x@other$loc.metrics$RepAvg)) {
cat(
warn(
" Dataset does not include RepAvg among the locus metrics, therefore the reproducibility filter was not used\n"
)
)
} else {
x <-
gl.filter.reproducibility(x, threshold = min.reproducibility, verbose = 0)
}
# Filter stringently on read depth, to further minimize miscalls
if (is.null(x@other$loc.metrics$rdepth)) {
cat(
warn(
" Dataset does not include rdepth among the locus metrics, therefore the read depth filter was not used\n"
)
)
} else {
x <- gl.filter.rdepth(x, lower = min.rdepth, verbose = 0)
}
# Preliminaries before for loops
x2 <- as.matrix(x)
split_vectors <- lapply(seq_len(nrow(x2)), function(i)
x2[i,])
names(split_vectors) <- popNames(x)
fun <- function(x, y) {
vect <- (x * 10) + y
homalts <- sum(vect == 2 | vect == 20, na.rm = T)
}
count <- sapply(split_vectors, function(vect1) {
sapply(split_vectors, function(vect2) {
fun(vect1, vect2)
})
})
counts <- count[lower.tri(count, diag = FALSE)]
# Prepare for plotting
if (verbose >= 2) {
cat(
report(
" Identifying outliers with lower than expected counts of pedigree inconsistencies\n"
)
)
}
title <-
paste0("SNP data (DArTSeq)\nCounts of pedigree incompatible loci per pair")
counts_plot <- as.data.frame(counts)
# Boxplot
p1 <-
ggplot(counts_plot, aes(y = counts)) + geom_boxplot(color = plot_colors[1], fill = plot_colors[2]) + coord_flip() + plot_theme + xlim(range = c(-1,
1)) + ylim(min(counts), max(counts)) + ylab(" ") + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + ggtitle(title)
outliers_temp <- ggplot_build(p1)$data[[1]]$outliers[[1]]
lower.extremes <-
outliers_temp[outliers_temp < stats::median(counts)]
if (length(lower.extremes) == 0) {
outliers <- NULL
} else {
outliers <- data.frame(Outlier = lower.extremes)
}
# Ascertain the identity of the pairs
if (verbose >= 2) {
cat(report(" Identifying outlying pairs\n"))
}
if (length(lower.extremes) > 0) {
tmp <- count
tmp[lower.tri(tmp)] = t(tmp)[lower.tri(tmp)]
for (i in 1:length(outliers$Outlier)) {
# Identify
tmp2 <- tmp[tmp == outliers$Outlier[i]]
outliers$ind1[i] <- popNames(x)[!is.na(tmp2)][1]
outliers$ind2[i] <- popNames(x)[!is.na(tmp2)][2]
# Z-scores
zscore <-
(mean(count, na.rm = TRUE) - outliers$Outlier[i]) / sd(count, na.rm = TRUE)
outliers$zscore[i] <- round(zscore, 2)
outliers$p[i] <-
round(pnorm(
mean = mean(count, na.rm = TRUE),
sd = sd(count, na.rm = TRUE),
q = outliers$zscore[i]
), 4)
}
# ordering by number of outliers
outliers <- outliers[order(outliers, decreasing = T),]
# removing duplicated values
outliers <- outliers[!duplicated(outliers),]
# removing NAs
outliers <- outliers[complete.cases(outliers),]
}
# Extract the quantile threshold
iqr <- stats::IQR(counts, na.rm = TRUE)
qth <- quantile(counts, 0.25, na.rm = TRUE)
cutoff <- qth - iqr * range
# Histogram
p2 <-
ggplot(counts_plot, aes(x = counts)) + geom_histogram(bins = 50,
color = plot_colors[1],
fill = plot_colors[2]) + geom_vline(xintercept = cutoff,
color = "red",
size = 1) + coord_cartesian(xlim = c(min(counts), max(counts))) + xlab("No. Pedigree incompatible") + ylab("Count") +
plot_theme
# Output the outlier loci
if (length(lower.extremes) == 0) {
cat(important(" No outliers detected\n"))
}
if (length(lower.extremes) > 0) {
outliers <- outliers[order(outliers$Outlier),]
if (verbose >= 3) {
print(outliers)
}
}
df <- outliers
# PRINTING OUTPUTS
if (plot.out) {
# using package patchwork
p3 <- (p1 / p2) + plot_layout(heights = c(1, 4))
print(p3)
}
# SAVE INTERMEDIATES TO TEMPDIR
# creating temp file names
if (save2tmp) {
if (plot.out) {
temp_plot <- tempfile(pattern = "Plot_")
match_call <-
paste0(names(match.call()),
"_",
as.character(match.call()),
collapse = "_")
# saving to tempdir
saveRDS(list(match_call, p3), file = temp_plot)
if (verbose >= 2) {
cat(report(" Saving the ggplot to session tempfile\n"))
}
}
temp_table <- tempfile(pattern = "Table_")
saveRDS(list(match_call, df), file = temp_table)
if (verbose >= 2) {
cat(report(" Saving tabulation to session tempfile\n"))
cat(
report(
" NOTE: Retrieve output files from tempdir using gl.list.reports() and gl.print.reports()\n"
)
)
}
}
# FLAG SCRIPT END
if (verbose >= 1) {
cat(report("Completed:", funname, "\n"))
}
# RETURN
invisible(x)
}