Skip to content

CoverM Dereplication: Why Do .tsv Outputs Differ from cluster_definition.tsv? #273

@sarehaghababaee

Description

@sarehaghababaee

Hello,
I am working with 12 metagenomic samples, from which I obtained 445 high-quality MAGs. I am using CoverM version 0.7.0 and ran it on these genomes using the script provided below.
Issue
The representative genomes listed in the .tsv output files (generated with the command
--output-file "$dir/19.CoverM/coverm_output/${b}_coverm.tsv") differ from the representative genomes shown in the output of the --dereplication-output-cluster-definition command.
I noticed that some genomes that belong to the same cluster in cluster_definition.tsv are reported as different representatives in the .tsv output.
Example:
• In cluster_definition.tsv: 43.66.fa and 40.38.fa belong to the same cluster.
• In 144_coverm.tsv: they are presented as two separate representative genomes.
Checks I Performed
• I ran dRep independently.
• I compared the number of clusters and the genome-species composition of each cluster between CoverM and dRep. They matched.
• The representative genomes from CoverM’s cluster_definition.tsv (but not those in the .tsv outputs) closely match those identified by dRep.
Could you please advise me on why this discrepancy occurs between the .tsv outputs and the cluster_definition.tsv file?
Additional Information
• For dereplication in CoverM, I used the file quality_report.tsv as input for the --checkm2-quality-report option.
• I have attached both the cluster_definition file and the .tsv file for one of the samples.
Here is the script I used to run CoverM:

if [[ -e "$dir/04.trimmed_fasta/${b}_2.fa" ]]; then
   coverm genome \
     -1 "$dir/04.trimmed_fasta/${b}_1.fa" \
     -2 "$dir/04.trimmed_fasta/${b}_2.fa" \
     --genome-fasta-directory "$dir/19.CoverM/genomes" \
     -x fa \
     --output-file "$dir/19.CoverM/coverm_output/${b}_coverm.tsv" \
     --output-format dense \
     --methods relative_abundance mean \
     --dereplicate \
     --dereplication-cluster-method fastani \
     --checkm2-quality-report "$dir/17.checkm2/output/quality_report.tsv" \
     --dereplication-quality-formula completeness-5contamination \
     --dereplication-output-cluster-definition "$dir/19.CoverM/coverm_output/cluster_definitions.tsv" \
     --dereplication-output-representative-list "$dir/19.CoverM/coverm_output/representative_paths.txt" \
     --threads "$THR"
else
   coverm genome \
     --single "$dir/04.trimmed_fasta/${b}_SingleReads.fa" \
     --genome-farun_coverm.pbssta-directory "$dir/19.CoverM/genomes" \
     -x fa \
     --output-file "$dir/19.CoverM/coverm_output/${b}_coverm.tsv" \
     --output-format dense \
     --methods relative_abundance mean \
     --dereplicate \
     --dereplication-cluster-method fastani \
     --checkm2-quality-report "$dir/17.checkm2/output/quality_report.tsv" \
     --dereplication-quality-formula completeness-5contamination \
     --dereplication-output-cluster-definition "$dir/19.CoverM/coverm_output/cluster_definitions.tsv" \
     --dereplication-output-representative-list "$dir/19.CoverM/coverm_output/representative_paths.txt" \
     --threads "$THR"
fi

cluster_definitions.tsv

144_coverm.tsv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions