Skip to content

binsplitting not happening with different separator? #460

@xvazquezc

Description

@xvazquezc

Hi there,

I've been using Vamb for a while (esp. TaxVamb since 4.1.4.dev150+g8fa3280) and just decided to update recently. I installed Vamb from master (5.0.5.dev20+g8a13cf5f8) and after dealing with the more strict checks for the taxonomy files (now every contig needs to be listed) I just run into something a bit odd, i.e. the vaevae_clusters_split.tsv file is basically empty (only column headers) but there is no error or warning in the log regarding the binsplitting not happening.

(base) z3382651@katana1:.../Functional/binning $ ll -rht taxvamb5_out/
total 881M
-rw-rw----+ 1 z3382651 ferrari 288M Nov  5 00:52 composition.npz
-rw-rw----+ 1 z3382651 ferrari  30M Nov  5 00:53 abundance.npz
-rw-rw----+ 1 z3382651 ferrari  32M Nov  5 02:46 predictor_model.pt
-rw-rw----+ 1 z3382651 ferrari 169M Nov  5 03:25 results_taxometer.tsv
-rw-rw----+ 1 z3382651 ferrari 202M Nov  5 10:26 vaevae_model.pt
-rw-rw----+ 1 z3382651 ferrari  87M Nov  5 10:28 vaevae_latent.npz
-rw-rw----+ 1 z3382651 ferrari   23 Nov  5 10:35 vaevae_clusters_split.tsv
-rw-rw----+ 1 z3382651 ferrari  53M Nov  5 10:35 vaevae_clusters_unsplit.tsv
-rw-rw----+ 1 z3382651 ferrari  22M Nov  5 10:35 vaevae_clusters_metadata.tsv
drwxrws---+ 2 z3382651 ferrari  44K Nov  5 10:36 bins
-rw-rw----+ 1 z3382651 ferrari 159K Nov  5 10:36 log.txt

I'm reusing already "labelled" contigs in the form of {SAMPLE}-{CONTIG}, i.e., using - as separator (and indicated as such in the vamb bin taxvamb command), and seemingly being detected (it shows at the end of the log):

2025-11-05 10:28:06.293 | INFO    | Clustering
2025-11-05 10:28:06.294 | INFO    |     Windowsize: 300
2025-11-05 10:28:06.294 | INFO    |     Min successful thresholds detected: 15
2025-11-05 10:28:06.294 | INFO    |     Max clusters: None
2025-11-05 10:28:06.294 | INFO    |     Use CUDA for clustering: True
2025-11-05 10:28:06.294 | INFO    |     Binsplitter: "-"
2025-11-05 10:28:06.456 | INFO    |       10 % of contigs clustered
2025-11-05 10:28:33.912 | INFO    |       20 % of contigs clustered
2025-11-05 10:28:40.737 | INFO    |       30 % of contigs clustered
2025-11-05 10:28:51.559 | INFO    |       40 % of contigs clustered
2025-11-05 10:29:09.755 | INFO    |       50 % of contigs clustered
2025-11-05 10:29:43.168 | INFO    |       60 % of contigs clustered
2025-11-05 10:30:36.292 | INFO    |       70 % of contigs clustered
2025-11-05 10:31:44.854 | INFO    |       80 % of contigs clustered
2025-11-05 10:32:54.945 | INFO    |       90 % of contigs clustered
2025-11-05 10:34:03.634 | INFO    |      100 % of contigs clustered
2025-11-05 10:35:11.088 | INFO    |     Clustered 1120675 contigs in 497327 split bins (480100 clusters)
2025-11-05 10:35:11.090 | INFO    |     Wrote cluster file(s) in 424.8 seconds.
2025-11-05 10:36:08.708 | INFO    |     Wrote clusters above 200000 bp to FASTA files in 57.62 seconds.

I only found out because if I run vamb recluster off the Vamb 5 output, it doesn't generate any bins, whether you indicate the binsplitter or not.

As a side note, the same data was successfully run with the old Vamb 4.1.4.dev150+g8fa3280, with both the binsplitting and recluster working without issues despite the alternate binsplitter character:

(base) z3382651@katana1:.../Functional/binning $ ll -rht taxvamb_out/
total 865M
-rw-rw----+ 1 z3382651 ferrari 286M Nov  1 09:47 composition.npz
-rw-rw----+ 1 z3382651 ferrari  30M Nov  1 09:47 abundance.npz
-rw-rw----+ 1 z3382651 ferrari  36M Nov  2 10:14 predictor_model.pt
-rw-rw----+ 1 z3382651 ferrari 127M Nov  2 11:09 results_taxometer.tsv
-rw-rw----+ 1 z3382651 ferrari 216M Nov  5 10:43 vaevae_model.pt
-rw-rw----+ 1 z3382651 ferrari  87M Nov  5 10:47 vaevae_latent.npz
-rw-rw----+ 1 z3382651 ferrari  30M Nov  5 12:46 vaevae_clusters_metadata.tsv
-rw-rw----+ 1 z3382651 ferrari  25M Nov  5 12:46 vaevae_clusters_unsplit.tsv
-rw-rw----+ 1 z3382651 ferrari  30M Nov  5 12:46 vaevae_clusters_split.tsv
drwxrws---+ 2 z3382651 ferrari  40K Nov  5 12:47 bins
-rw-rw----+ 1 z3382651 ferrari 151K Nov  5 12:47 log.txt

The commands used for both runs were the same, with the exception of --cuda removed from the Vamb 4 run as the university HPC has a 12h limit on the GPU queue and was going overtime:

vamb bin taxvamb -p 8 --cuda -o - \
--outdir taxvamb5_out --fasta merged_contigs.fasta \
--abundance_tsv abundance.tsv \
--taxonomy mmseqs-easy/taxonomy_lca.taxconv.tsv \
--minfasta 200000

vamb recluster -p 8 --cuda -o - \
--outdir recluster5_out \
--fasta merged_contigs.fasta \
--abundance taxvamb5_out/abundance.npz \
--latent_path taxvamb5_out/vaevae_latent.npz \
--taxonomy taxvamb5_out/results_taxometer.tsv \
--clusters_path taxvamb5_out/vaevae_clusters_split.tsv \
--hmm_path /srv/scratch/ferrari/utils/vamb/vamb/marker.hmm \
--minfasta 200000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions