forked from benjjneb/dada2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsetDadaOpt.Rd
131 lines (100 loc) · 6.32 KB
/
setDadaOpt.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dada.R
\name{setDadaOpt}
\alias{setDadaOpt}
\title{Set DADA options}
\usage{
setDadaOpt(...)
}
\arguments{
\item{...}{(Required). The DADA options to set, along with their new value.}
}
\value{
NULL.
}
\description{
setDadaOpt sets the default options used by the dada(...) function for your current session, much
like \code{par} sets the session default plotting parameters. However, all dada options can be set as
part of the dada(...) function call itself by including a DADA_OPTION_NAME=VALUE argument.
}
\details{
**Sensitivity**
OMEGA_A: This parameter sets the threshold for when DADA2 calls unique sequences significantly overabundant, and therefore creates a
new partition with that sequence as the center. Default is 1e-40, which is a conservative setting to avoid making false
positive inferences, but which comes at the cost of reducing the ability to identify some rare variants.
OMEGA_P: The threshold for unique sequences with prior evidence of existence (see `priors` argument). Default is 1e-4.
OMEGA_C: The threshold at which unique sequences inferred to contain errors are corrected in the final output.
The probability that each unique sequence
is generated at its observed abundance from the center of its final partition is evaluated, and compared to OMEGA_C. If that
probability is >= OMEGA_C, it is "corrected", i.e. replaced by the partition center sequence. The special value of 0 corresponds
to correcting all input sequences, and any value > 1 corresponds to performing no correction on sequences found to contain
errors. Default is 1e-40 (same as OMEGA_A).
DETECT_SINGLETONS: If set to TRUE, this removes the requirement for at least two reads with the same sequences to exist
in order for a new ASV to be detected. It also somewhat increases sensitivity to other low abundance sequences as well,
e.g. those present in just 2/3/4/... reads. Note, this applies to all unique sequences, not just those supported by
prior evidence (see `priors` argument), and so it does make false-positive detections more likely.
**Alignment**
MATCH: The score of a match in the Needleman-Wunsch alignment. Default is 4.
MISMATCH: The score of a mismatch in the Needleman-Wunsch alignment. Default is -5.
GAP_PENALTY: The cost of gaps in the Needleman-Wunsch alignment. Default is -8.
HOMOPOLYMER_GAP_PENALTY: The cost of gaps in homopolymer regions (>=3 repeated bases). Default is NULL, which causes homopolymer
gaps to be treated as normal gaps.
BAND_SIZE: When set, banded Needleman-Wunsch alignments are performed. Banding restricts the net cumulative number of insertion
of one sequence relative to the other. The default value of BAND_SIZE is 16. If DADA is applied to sequencing technologies with
high rates of indels, such as 454 sequencing, the BAND_SIZE parameter should be increased. Setting BAND_SIZE to a negative number
turns off banding (i.e. full Needleman-Wunsch).
**Sequence Comparison Heuristics**
USE_KMERS: If TRUE, a 5-mer distance screen is performed prior to performing each pairwise alignment, and if the 5mer-distance
is greater than KDIST_CUTOFF, no alignment is performed. Default is TRUE.
KDIST_CUTOFF: The default value of 0.42 was chosen to screen pairs of sequences that differ by >10\%, and was
calibrated on Illumina sequenced 16S amplicon data. The assumption is that sequences that differ by such a large
amount cannot be linked by amplicon errors (i.e. if you sequence one, you won't get a read of other) and so
careful (and costly) alignment is unnecessary.
GAPLESS: If TRUE, the ordered kmer identity between pairs of sequences is compared to their unordered
overlap. If equal, the optimal alignment is assumed to be gapless. Default is TRUE.
Only relevant if USE_KMERS is TRUE.
GREEDY: The DADA2 algorithm is not greedy, but a very restricted form of greediness can be turned
on via this option. If TRUE, unique sequences with reads less than those expected to be generated
by resequencing just the central unique in their partition are "locked" to that partition.
Modest (~30\%) speedup, and almost no impact on output. Default is TRUE.
**New Partition Conditions**
MIN_FOLD: The minimum fold-overabundance for sequences to form new partitions. Default value is 1, which means this
criteria is ignored.
MIN_HAMMING: The minimum hamming-separation for sequences to form new partitions. Default value is 1, which means this
criteria is ignored.
MIN_ABUNDANCE: The minimum abundance for unique sequences form new partitions. Default value is 1, which means this
criteria is ignored.
MAX_CLUST: The maximum number of partitions. Once this many partitions have been created, the algorithm terminates regardless
of whether the statistical model suggests more real sequence variants exist. If set to 0 this argument is ignored. Default
value is 0.
**Self Consistency**
MAX_CONSIST: The maximum number of steps when selfConsist=TRUE. If convergence is not reached in MAX_CONSIST steps,
the algorithm will terminate with a warning message. Default value is 10.
**Pseudo-pooling Behavior**
PSEUDO_PREVALENCE: When performing pseudo-pooling, all sequence variants found in at least this many
samples are used as priors for a subsequent round of sample inference.
Only relevant if `pool="pseudo"`. Default is 2.
PSEUDO_ABUNDANCE: When performing pseudo-pooling, all denoised sequence variants with total
abundance (over all samples) greater than this are used as priors for a subsequent round
of sample inference.
Only relevant if `pool="pseudo"`. Default is Inf (i.e. abundance ignored for this purpose).
**Error Model**
USE_QUALS: If TRUE, the dada(...) error model takes into account the consensus quality score of the dereplicated unique sequences.
If FALSE, quality scores are ignored. Default is TRUE.
**Technical**
SSE: Controls the level of explicit SSE vectorization for kmer calculations. Default 2. Maintained for development reasons,
should have no impact on output.
\itemize{
\item{0: No explicit vectorization (but modern compilers will auto-vectorize the code).}
\item{1: Explicit SSE2. }
\item{2: Explicit, packed SSE2 using 8-bit integers. Slightly faster than SSE=1. }
}
}
\examples{
setDadaOpt(OMEGA_A = 1e-20)
setDadaOpt(MATCH=1, MISMATCH=-4, GAP_PENALTY=-6)
setDadaOpt(GREEDY=TRUE, GAPLESS=TRUE)
}
\seealso{
\code{\link{getDadaOpt}}
}