Skip to content

clusthash doesn't seem to use all given threads, and result2flat breaks with "segmentation fault"  #261

@UriNeri

Description

@UriNeri

Expected Behavior

clusthash uses all threads and result2flat produces a complete fasta file (not ending in %)

Current Behavior

clusthash seems to use only 1 of all given threads, and eventually result2flat breaks "segmentation fault"

Steps to Reproduce (for bugs)

Ran from the terminal in the same directory as a contigs fasta file (DNA) (named "cated_sk100.fna"):
THREADS=10
mkdir resultsDB scafDB
mmseqs createdb cated_sk100.fna scafDB/cated_sk100
mmseqs clusthash scafDB/cated_sk100 resultsDB/resultDB --min-seq-id 0.99 --threads $THREADS
mmseqs clust scafDB/cated_sk100 resultsDB/resultDB clusterDB --threads $THREADS
mmseqs result2repseq scafDB/cated_sk100 clusterDB DB_clu_rep
mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header

When "Compute 1 unique hashes." is printed, there are 10 resultsDB files and 10 resultDB.index files, however, only one (resultDB.index.7) is getting larger with time (and is > 0 in size). Meanwhile only one thread seems to be utilized (around 8% of the total 10 threads given).
When the clusthash finishes there is one resultsDB.index file, and 10 resultsDB files, 8 with zero size, and resultsDB.index7 and resultsDB.index both with the same size). After this, the process breaks in the last command:
mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
With the message:
`result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header

MMseqs Version: 48a037a
Use fasta header true
Verbosity 3

[1] 18252 segmentation fault (core dumped) mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep`

MMseqs Output (for bugs)

Which output should I upload?

Context

I'm trying to remove redundancy by collapsing sequences that are either highly similar (99%) or are also contained within longer sequences from other fasta entries in the file. This fasta file size <1gb but I first tried to run this process on a >80gb file on remote compute node and was concerned when I saw the job was using only a small part of the resources.
Not part of this issue but realted; also tried to do the same thing with a large protein file but I get invalid fasta entry errors (maybe because of the "*" marking STOPs left by the ORF predictor, but that wouldn't happen in the nucleic acid file example above).

Your Environment

  • Git commit used:
    I tried on my personal machine and a compute node (PBS), similar behaviour in both.
    Personal machine MMseqs2 Version: 48a037a.
    Server MMseqs2 Version: 2a8c5f0.
  • Which MMseqs version was used: Statically-compiled
  • Server specifications:
    Server: (2a8c5f0)
    CPU: Intel(R) Xeon(R) Platinum 8168
    Memory: 366 GB
    Personal machine: (48a037a)
    CPU: Intel Core i7-8700 6-Core model: bits: 64 type: L2 cache: 12.0 MiB
    Memory: 15.33 GB
  • Operating system and version:
    Personal machine: Linux Mint 19.2 Tina Kernel: 4.15.0-72-generic x86_64;
    Server: Linux 3.10.0-693.el7.x86_64

Thanks for developing and maintaining this totally amazing tool !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions