-
Notifications
You must be signed in to change notification settings - Fork 181
Clustering
Diamond clusters sequences analogous to Cd-Hit or UClust based on a user-defined clustering criterion,
finding a set of centroid or representative sequences and assigning each input sequence to the cluster
of one centroid such that the clustering criterion vs. the centroid is fulfilled. The clustering criterion
is defined by sequence coverage of the local alignment as well its sequence identity (see below). Note that
due to the heuristic nature of the cascaded clustering algorithm, these cutoff values serve to guide the
computation, but their fulfillment is not always guaranteed, unless the recluster
workflow is used (see below).
Basic command line example:
diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G
Cluster an input database of protein sequences.
-
--database/-d
The input sequence database. Supported formats are FASTA and DIAMOND (.dmnd
) format. -
--out/-o
Output file. This is a 2-column tabular file with the centroid accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using therealign
workflow. -
--header
Enable a header line in the output file. -
--memory-limit/-M #
Set a memory limit for the diamond process (for example:-M 64G
). This is not a hard upper limit and may still be exceeded in certain cases. Decrease this number in case the tool fails due to running out of memory. Note that higher numbers increase the performance by a lot, so it is strongly recommended to always set this option. Note that this option affects the algorithm and therefore the results. Clustering is a heuristic procedure with no unique solution. -
--approx-id #
The identity cutoff for the clustering (in %). Note that for performance reasons, the setting refers to the approximate sequence identity derived as a linear regression from the bitscore, not the actual number of identities in the alignment. The default value is 50% when runningdiamond cluster
and 0% when runningdiamond deepclust
. -
--member-cover #
The minimum coverage of the cluster member sequence by the centroid (in %). This is a unidirectional coverage i.e. a minimum coverage of the centroid is not required. The default is 80%. -
--no-block-size-limit
Do not limit the block size to recommended maximums. -
--cluster-steps
Set the sequence of clustering rounds for cascaded clustering as a space-separated list. Permitted keyword are the sensitivity switches of the alignment workflow (e.g.sensitive
). When missing, this parameter is automatically chosen based on the--approx-id
parameter.
Given a clustering computed by the cluster
workflow as input, this workflow computes alignments of
all sequences in the original database against their assigned centroid sequences.
-
--clusters
The clustering as 2-column tabular format. -
--outfmt/-f
Set the output format. Only tabular format is supported for this workflow. The default correponds to the format-f 6 qseqid sseqid approx_pident qstart qend sstart send evalue bitscore
of the alignment workflow, where the query and subject correspond to the centroid and the cluster member sequence respectively.
These parameters of the cluster
workflow apply accordingly: --database/-d
, --out/-o
, --header
,
--memory-limit/-M
, --approx-id
, --member-cover
.
Fixes errors in a given clustering where a cluster member sequence does not satisfy the clustering criterion against its centroid. Such errors may arise due to the heuristic nature of the cascaded clustering algorithm due to the merging of clusters based on alignments of their centroid sequences.
These parameters of the cluster
workflow apply accordingly: --database/-d
, --out/-o
, --header
,
--memory-limit/-M
, --approx-id
, --no-block-size-limit
, --member-cover
.
For a given clustering, attempts to reassign all non-centroid sequences to the closest centroid sequence as measured by the e-value of the local alignment.
These parameters of the cluster
workflow apply accordingly: --database/-d
, --out/-o
, --header
,
--memory-limit/-M
, --approx-id
, --no-block-size-limit
, --member-cover
.
Compute greedy vertex cover clustering based on alignment input.
-
--edges
Input file containing alignments/graph edges for clustering. By default, a TSV file with 5 columns is expected: query target query-cover target-cover edge-weight. -
--database/-d
A TSV file whose first column needs to be a list of all accessions that occur in the edges file as either query or target. This must not be a sequence database file. -
--edge-format (triplet)
Enable triplet edge format: query target edge-weight. The semantic is unidirectional representation of the query by the target. -
--centroid-out
Output file for centroid list.
These parameters of the cluster
workflow apply accordingly: --out/-o
, --header
, --member-cover
.
These parameters of the alignment workflow apply accordingly to the cluster
, realign
, recluster
,
reassign
and greedy vertex cover
workflow: --threads/-p
, --verbose/-v
, --log
, --quiet
,
--tmpdir/-t
.
These parameters of the alignment workflow apply accordingly to the cluster
, recluster
and reassign
workflow: --evalue/-e
, --masking
, --soft-masking
, --motif-masking
, --ext
.
These parameters of the alignment workflow apply accordingly to the cluster
, realign
, recluster
and reassign
workflow: --comp-based-stats