-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Expected Behavior
clusthash uses all threads and result2flat produces a complete fasta file (not ending in %)
Current Behavior
clusthash seems to use only 1 of all given threads, and eventually result2flat breaks "segmentation fault"
Steps to Reproduce (for bugs)
Ran from the terminal in the same directory as a contigs fasta file (DNA) (named "cated_sk100.fna"):
THREADS=10
mkdir resultsDB scafDB
mmseqs createdb cated_sk100.fna scafDB/cated_sk100
mmseqs clusthash scafDB/cated_sk100 resultsDB/resultDB --min-seq-id 0.99 --threads $THREADS
mmseqs clust scafDB/cated_sk100 resultsDB/resultDB clusterDB --threads $THREADS
mmseqs result2repseq scafDB/cated_sk100 clusterDB DB_clu_rep
mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
When "Compute 1 unique hashes." is printed, there are 10 resultsDB files and 10 resultDB.index files, however, only one (resultDB.index.7) is getting larger with time (and is > 0 in size). Meanwhile only one thread seems to be utilized (around 8% of the total 10 threads given).
When the clusthash finishes there is one resultsDB.index file, and 10 resultsDB files, 8 with zero size, and resultsDB.index7 and resultsDB.index both with the same size). After this, the process breaks in the last command:
mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
With the message:
`result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
MMseqs Version: 48a037a
Use fasta header true
Verbosity 3
[1] 18252 segmentation fault (core dumped) mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep`
MMseqs Output (for bugs)
Which output should I upload?
Context
I'm trying to remove redundancy by collapsing sequences that are either highly similar (99%) or are also contained within longer sequences from other fasta entries in the file. This fasta file size <1gb but I first tried to run this process on a >80gb file on remote compute node and was concerned when I saw the job was using only a small part of the resources.
Not part of this issue but realted; also tried to do the same thing with a large protein file but I get invalid fasta entry errors (maybe because of the "*" marking STOPs left by the ORF predictor, but that wouldn't happen in the nucleic acid file example above).
Your Environment
- Git commit used:
I tried on my personal machine and a compute node (PBS), similar behaviour in both.
Personal machine MMseqs2 Version: 48a037a.
Server MMseqs2 Version: 2a8c5f0. - Which MMseqs version was used: Statically-compiled
- Server specifications:
Server: (2a8c5f0)
CPU: Intel(R) Xeon(R) Platinum 8168
Memory: 366 GB
Personal machine: (48a037a)
CPU: Intel Core i7-8700 6-Core model: bits: 64 type: L2 cache: 12.0 MiB
Memory: 15.33 GB - Operating system and version:
Personal machine: Linux Mint 19.2 Tina Kernel: 4.15.0-72-generic x86_64;
Server: Linux 3.10.0-693.el7.x86_64
Thanks for developing and maintaining this totally amazing tool !