Skip to content

less accurate estimator compare to MASH/sourmash/ANI #97

@jianshu93

Description

@jianshu93

Hello Daniel,

I am attaching a real-world genome from the global Tara Ocean Metagenomic study, against all GTDB genomes (https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/genomic_files_reps/gtdb_genomes_reps_r207.tar.gz) to find top 20 best matches in terms of ANI, I am using orthoANI(https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000760), Both MASH, and sourmash performs well, normally, 16 to 17 of best found compare to ANI best hits found. However, Dashing (both default MLE estimator and also JMLE) is very bad at ANI smaller than 80%, only 9 (top 10 are fine) are found out of 20, meaning for smaller distance, Dashing is much worse than Mash or sourmash, both are MinHash but not hyperloglog. I was under the impression that Jaccard index by HLL should be as good as MinHash.

This is the commands used:

dashing sketch -k 16 --nthreads 128 -S 14 --ertl-joint-mle --suffix dashing_hll -F name.txt &
dashing sketch -k 16 --nthreads 128 -S 14 --ertl-joint-mle --suffix dashing_hll -F query_name.txt

then get all the hll file from the genome folder and create list of those hll files.

dashing dist -F ./query_name_dashing_hll_JMLE.txt -Q name_dashing_hll_JMLE.txt --full-tsv --nthreads 128 --presketched -O ./OceanDNA-b42278.dashing.hll.JMLE.gtdb.txt.

I am using the same k and sketch size (2^14) in Mash and sourmash. Top 10 are ok, nearly all are found. I also compare with our most recent SetSketch 1 implenmentation (equivalent to HLL), ours are consistent with sourmash or Mash. I am showing you the best 10th to 20th hits found to the query (OceanDNA-b42278.fa) by several tools (the attached pdf file, forget top 10 in the table title, it is actually top 10 of 10th to 20th) mentioned above for you to double check. Should I use an even large sketch size to better approximate ANI, I think not because top 10 are already very good, meaning sketch size is enough. Dashing is faster for sure than Mash, I am wondering what could be the down side of being fast, e.g., less accurate for very smaller Jaccard index/distance (not similar ones).

Thanks,

Jianshu

OceanDNA-b42278.fa.zip

Blastn-ANI-dashing-setsketch.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions