Ryan R. Wick1 and Kathryn E. Holt1,2
1. Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria 3004, Australia
2. London School of Hygiene & Tropical Medicine, London WC1E 7HT, UK
This repo contains a standalone Python script (dereplicator.py
) to solve a problem I occasionally run into: dereplicating a group of bacterial genome assemblies. Dereplication means removing assemblies for which there are sufficiently close relatives, resulting in a smaller set where the assemblies are more unique.
As an example, imagine you have 10000 genome assemblies for a particular taxon and want to do some analysis on them, maybe building a pan genome. You know there is redundancy in this set because some of the genomes come from outbreaks and are nearly identical to each other. So instead of doing the analysis on all 10000 assemblies, you can dereplicate them to a smaller set (i.e. remove near-identical redundant genomes) so your analysis will be faster.
To give you a visual idea of how this works, here are trees built from 1000 assemblies from the genus Klebsiella dereplicated to various distance levels. You can see that in the no-dereplication and lower-distance trees, one clade dominates (corresponding to K. pneumoniae, the most sequenced species in the genus). The higher-distance trees are more balanced, and when dereplicated to a distance of 0.025, there is only one assembly per species/subspecies.
You'll need Python 3.8 or later to run dereplicator.py
. It's a standalone script and has no Python package dependencies (i.e. it only uses the standard library).
There is one external dependency, Mash, which you'll need installed and callable on your command line. If you can run mash sketch -h
and mash dist -h
without getting an error, you should be good to go!
Since Assembly Dereplicator is a single script, no installation is required. You can simply clone it and run it:
git clone https://github.com/rrwick/Assembly-dereplicator
Assembly-dereplicator/dereplicator.py --help
If you plan on using it often, you can copy it to someplace in your PATH variable for easier access:
git clone https://github.com/rrwick/Assembly-dereplicator
cp Assembly-dereplicator/dereplicator.py ~/.local/bin
dereplicator.py --help
If you'd like to double-check that everything works as intended, you can run this repo's automated tests.
The basic dereplication process is to find the closest pair of assemblies in the set, discard the assembly in that pair with the lower N50 value and repeat until a stop condition is reached. By using N50 as a metric of assembly quality, complete assemblies are preferentially kept over draft assemblies.
There are three possible ways to specify how much to dereplicate:
- A minimum pairwise distance, e.g.
--distance 0.001
. This will make dereplication continue until no two assemblies in the set have a Mash distance of less than the given value. - An assembly count, e.g.
--count 100
. This will make dereplication continue until the target number of assemblies is reached. - An assembly fraction, e.g.
--fraction 0.5
. This will make dereplication continue until the target fraction of assemblies is reached.
If multiple criteria are used, then dereplication will continue until all are satisfied. E.g. --distance 0.001 --count 100
will ensure that no two assemblies have a Mash distance of greater than 0.001 and there are no more than 100 assemblies.
The process is deterministic, so running the script multiple times will give the same result. Also, running the script in multiple steps (e.g. first with --distance 0.01
then again with --distance 0.02
) will give the same result as running in a single step (e.g. using --distance 0.02
the first time).
Assembly Dereplicator is easy to run. Just give it an input directory containing your assemblies in FASTA format, an output directory and a value for --distance
, --count
or --fraction
:
dereplicator.py --distance 0.01 assembly_dir output_dir
usage: dereplicator.py [--distance DISTANCE] [--count COUNT] [--fraction FRACTION]
[--sketch_size SKETCH_SIZE] [--threads THREADS] [--verbose] [-h]
[--version]
in_dir out_dir
Assembly Dereplicator
Positional arguments:
in_dir Directory containing all assemblies
out_dir Directory where dereplicated assemblies will be copied
Dereplication target:
--distance DISTANCE Dereplicate until the closest pair has a Mash distance of this value
or greater
--count COUNT Dereplicate until there are no more than this many assemblies
--fraction FRACTION Dereplicate until there are no more than this fraction of the
original number of assemblies
Settings:
--sketch_size SKETCH_SIZE Mash assembly sketch size (default: 10000)
--threads THREADS Number of CPU threads for Mash (default: 10)
--verbose Display more output information
Other:
-h, --help Show this help message and exit
--version Show program's version number and exit
Positional arguments:
in_dir
: directory containing input assemblies, will be searched recursively. The input assemblies must be in FASTA format and the script will find them in the input directory based on extension. Any of the following are acceptable:.fasta
,.fasta.gz
,.fna
,.fna.gz
,.fa
and.fa.gz
.out_dir
: directory where dereplicated assemblies (i.e. the representative assembly for each cluster) will be copied. If this directory doesn't exist, it will be created. Copied assemblies will have the same filename they had in the input directory.
Dereplication target:
--distance
: the target minimum pairwise Mash distance between assemblies. Must be between 0 and 1. Mash distance roughly corresponds to one minus average nucleotide identity. Setting it to a small value (e.g. 0.001) will result in more dereplicated assemblies, i.e. only very close relatives will be excluded. Setting it to a large value (e.g. 0.02) will result in fewer dereplicated assemblies, perhaps just one per species.--count
: the target number of output assemblies. If this value is equal to or greater than the number of input assemblies, no dereplication will take place.--fraction
: the target fraction of output assemblies. Must be between 0 and 1.- At least one of
--distance
,--count
or--fraction
must be used. Multiple targets can be used (e.g. both--distance
and--count
), in which case dereplication will continue until all targets are met.
Settings:
--sketch_size
: controls the Mash sketch size. A smaller value (e.g. 1000) will make the process faster but slightly less accurate.--threads
: controls how many threads Mash uses for its sketching and distances. Mash scales well in parallel, so use lots of threads if you have them!--verbose
: using this flag results in more information being printed to stdout.
Since dereplicator.py
uses pairwise distances, it scales with the square of the number of input assemblies, i.e. O(n2). This means that small numbers of input assemblies (e.g. <100) will be very fast, and large numbers (e.g. >10000) will take much longer. Mash parallelises well, so use as many threads as you can for large datasets.
As a test, I ran dereplicator.py
on all Klebsiella genomes in GTDB. At the time of writing, this was about 13700 assemblies, so about 188 million pairwise comparisons. Using 48 CPU threads, the process took about 40 minutes to complete and used about 35 GB of RAM.