Clustering

Diamond clusters sequences analogous to Cd-Hit or UClust based on a user-defined clustering criterion, finding a set of centroid or representative sequences and assigning each input sequence to the cluster of one centroid such that the clustering criterion vs. the centroid is fulfilled. The clustering criterion is defined by sequence coverage of the local alignment as well its sequence identity (see below). Note that due to the heuristic nature of the cascaded clustering algorithm, these cutoff values serve to guide the computation, but their fulfillment is not always guaranteed, unless the recluster workflow is used (see below).

Basic command line example:

diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G

Cluster workflow

Cluster an input database of protein sequences.

--database/-d The input sequence database. Supported formats are FASTA and DIAMOND (.dmnd) format.
--out/-o Output file. This is a 2-column tabular file with the centroid accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using the realign workflow.
--header Enable a header line in the output file.
--memory-limit/-M # Set a memory limit for the diamond process (for example: -M 64G). This is not a hard upper limit and may still be exceeded in certain cases. Decrease this number in case the tool fails due to running out of memory. Note that higher numbers increase the performance by a lot, so it is strongly recommended to always set this option. Note that this option affects the algorithm and therefore the results. Clustering is a heuristic procedure with no unique solution.
--approx-id # The identity cutoff for the clustering (in %). Note that for performance reasons, the setting refers to the approximate sequence identity derived as a linear regression from the bitscore, not the actual number of identities in the alignment. The default value is 50% when running diamond cluster and 0% when running diamond deepclust.
--member-cover # The minimum coverage of the cluster member sequence by the centroid (in %). This is a unidirectional coverage i.e. a minimum coverage of the centroid is not required. The default is 80%.
--no-block-size-limit Do not limit the block size to recommended maximums.
--cluster-steps Set the sequence of clustering rounds for cascaded clustering as a space-separated list. Permitted keyword are the sensitivity switches of the alignment workflow (e.g. sensitive). When missing, this parameter is automatically chosen based on the --approx-id parameter.

`realign` workflow

Given a clustering computed by the cluster workflow as input, this workflow computes alignments of all sequences in the original database against their assigned centroid sequences.

--clusters The clustering as 2-column tabular format.
--outfmt/-f Set the output format. Only tabular format is supported for this workflow. The default correponds to the format -f 6 qseqid sseqid approx_pident qstart qend sstart send evalue bitscore of the alignment workflow, where the query and subject correspond to the centroid and the cluster member sequence respectively.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --member-cover.

`recluster` workflow

Fixes errors in a given clustering where a cluster member sequence does not satisfy the clustering criterion against its centroid. Such errors may arise due to the heuristic nature of the cascaded clustering algorithm due to the merging of clusters based on alignments of their centroid sequences.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --no-block-size-limit, --member-cover.

`reassign` workflow

For a given clustering, attempts to reassign all non-centroid sequences to the closest centroid sequence as measured by the e-value of the local alignment.

These parameters of the cluster workflow apply accordingly: --database/-d, --out/-o, --header, --memory-limit/-M, --approx-id, --no-block-size-limit, --member-cover.

`greedy-vertex-cover` workflow

Compute greedy vertex cover clustering based on alignment input.

--edges Input file containing alignments/graph edges for clustering. By default, a TSV file with 5 columns is expected: query target query-cover target-cover edge-weight.
--database/-d A TSV file whose first column needs to be a list of all accessions that occur in the edges file as either query or target. This must not be a sequence database file.
--edge-format (triplet) Enable triplet edge format: query target edge-weight. The semantic is unidirectional representation of the query by the target.
--centroid-out Output file for centroid list.

These parameters of the cluster workflow apply accordingly: --out/-o, --header, --member-cover.

Alignment options

These parameters of the alignment workflow apply accordingly to the cluster, realign, recluster, reassign and greedy vertex cover workflow: --threads/-p, --verbose/-v, --log, --quiet, --tmpdir/-t.

These parameters of the alignment workflow apply accordingly to the cluster, recluster and reassign workflow: --evalue/-e, --masking, --soft-masking, --motif-masking, --ext.

These parameters of the alignment workflow apply accordingly to the cluster, realign, recluster and reassign workflow: --comp-based-stats

Home
Tutorial
Installation
Command line options
- Advanced options
Clustering
- How to cluster huge datasets
Support & FAQ
Advanced topics
Benchmarks
- Small query
- Repeat masking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering

Cluster workflow

`realign` workflow

`recluster` workflow

`reassign` workflow

`greedy-vertex-cover` workflow

Alignment options

Clone this wiki locally

Clustering

Cluster workflow

realign workflow

recluster workflow

reassign workflow

greedy-vertex-cover workflow

Alignment options

Clone this wiki locally

`realign` workflow

`recluster` workflow

`reassign` workflow

`greedy-vertex-cover` workflow