Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing/misformed clustered-fasta output #258

Closed
tomblaze opened this issue Dec 15, 2019 · 3 comments
Closed

Confusing/misformed clustered-fasta output #258

tomblaze opened this issue Dec 15, 2019 · 3 comments

Comments

@tomblaze
Copy link

Hope I am not making a simple mistake but output seems to be not what I expect. Thanks for any help.

Expected Behavior

Fasta-formatted cluster results have output following documentation.

Current Behavior

Fasta-formatted cluster results have empty lines that don't seem to correspond to correct clustering.

Examples:

>Pip8wV
>Pip8wV
MQKNLIQRYIPLSDKNWFCTGGPARFYTEPETTQEFVKSLEFANLHNLELFILGKGANILISDNGFDGLVIKPNLQQISHQDYDNDHALVHAQAGACIKDLITYCLKNNMTGPEEFSGIPGTIGGSVYINIHYFNFFLSDFLYNAQVIHKKTGDIFTVTKNWFNFGYDKSKLFEKEYFLVQATFKLKKITPIKAAYATGRSIEIIRHRNARYPNSGTCGSFFRNFLDSEVTLNIAGTEKKMIYVGYYLDKLGVKGNLNVGGASVSHKHANMIVNTNNATSTDIINLARQMQKLVQENFGITPQPECQLIGFKKYPLL*
>PJB17I
>E7HKRc
MFKRSVCVFCGSRKGRNPAHEAAATDLGTALAQNDMRLVYGAGDVGLMGAVARAAQAAGGETFGVIPDHLVKWEVGKTDLTRYIVTETMHERKKVMFMNCDAVVVLPGGAGSLDEYFEVLTWRQIGLHEKPVFLLNTEGYWTKLIHLIDHVIDEGFADASLRDYTTVVDTVQDLMDGLGATGR*

and

>S2LnDK
>S2LnDK
MNNKSENQSSQGDARRGMLIGMALGMGIGAVIGLVLGDISSGFIIGMGIGAAVGYWRFRDTPIGMRYPPHLVRQLIISGLLYLVTLIGVIYLLEYNLNRSIEIILVLLPALPGIWFLVSLGRAIASLDELQRFIQLEGIAIGFGITAMAALTYGLLGMAGVPQVSWMYVPVVMVFGWFLGKMWTLWRYR*
>E6GnE9
>PYRss2
MKKAMATSKSQFKTMDEYIATFPENVRDVLEKLRRTIMESAPKAEETISYGMPAFKLNGKSLVYFAAWKNHIGFYPAGPSAIKAFKKELSPFRQAKGTIQFPLDKPIPIDLVKKIVKFRVEENESKKK*

Where in original input file, we see:

>PJB17I
MQKRSICVFCGASDGIDSAYGAAADTLGRLIASHKMRLVYGAGDVGLMGRVARAAQKDGAATFGVIPKHLVNWEVGKTDLTTYIITENMHERKKVMFMNSDAIALLPGGAGSLDEFFEVLTWAQLGLHDKPIVLININSYWGPLLALLDHVIAQGFAKENIKDFFQIAVTPEEAMSKLA*

... and elsewhere in file ...

>E6GnE9
MNNKTGIQPIDEYIAAFPEEVQEKLISIYRLIAGDVPEATVKISYGMPTFVLNGNLVHFAAFKNHIGFYPAPSGIQAFQEELAGYKTSKGAIQFPLDKPVPYELISKITAFRVAENVKNN*

It seems that PJB17I should cluster with E7HKRc (they're ~50% identical) but then I would expect the output to be:

>PJB17I
>PJB17I
MQKRSICVFCGASDGIDSAYGAA...
>E7HKRc
MFKRSVCVFCGSRKGRNPAHE...

Based on latest documentation.

Steps to Reproduce (for bugs)

Just the commands (see full log below):

user@user:~$ ./mmseqs/bin/mmseqs createdb experiments/small_sample.fa experiments/small.mm.db
user@user:~$ ./mmseqs/bin/mmseqs linclust experiments/small.mm.db experiments/small50.mm.db experiments/tmp/ --min-seq-id 0.5
user@user:~$ ./mmseqs/bin/mmseqs createseqfiledb experiments/small.mm.db experiments/small50.mm.db experiments/small50_seq
user@user:~$ ./mmseqs/bin/mmseqs result2flat experiments/small.mm.db experiments/small.mm.db experiments/small50_seq experiments/small50_seq.fa

MMseqs Output (for bugs)

Full session output:

user@user:~$ ./mmseqs/bin/mmseqs createdb experiments/small_sample.fa experiments/small.mm.db
Converting sequences
[24948] 0s 72ms
Time for merging to small.mm.db_h: 0h 0m 0s 37ms
Time for merging to small.mm.db: 0h 0m 0s 16ms
Database type: Aminoacid
Time for merging to small.mm.db.lookup: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 204ms
user@user:~$ ./mmseqs/bin/mmseqs linclust experiments/small.mm.db experiments/small50.mm.db experiments/tmp/ --min-seq-id 0.5
linclust experiments/small.mm.db experiments/small50.mm.db experiments/tmp/ --min-seq-id 0.5 

MMseqs Version:                     	02be0c4ea6183fce9cf45521a8c145d10f3928c1
Cluster mode                        	0
Max depth connected component       	1000
Similarity type                     	2
Threads                             	12
Compressed                          	0
Verbosity                           	3
Substitution matrix                 	nucl:nucleotide.out,aa:blosum62.out
Add backtrace                       	false
Alignment mode                      	2
Allow wrapped scoring               	false
E-value threshold                   	0.001
Seq. id. threshold                  	0.5
Min. alignment length               	0
Seq. id. mode                       	0
Alternative alignments              	0
Coverage threshold                  	0.8
Coverage mode                       	0
Max sequence length                 	65535
Compositional bias                  	1
Realign hits                        	false
Max reject                          	2147483647
Max accept                          	2147483647
Include identical seq. id.          	false
Preload mode                        	0
Pseudo count a                      	1
Pseudo count b                      	1.5
Score bias                          	0
Gap open cost                       	11
Gap extension cost                  	1
Alphabet size                       	21
K-mers per sequence                 	21
scale k-mers per sequence           	0
Adjust k-mer length                 	false
Mask residues                       	0
Mask lower case residues            	0
K-mer size                          	0
Shift hash                          	5
Split memory limit                  	0
Include only extendable             	false
Skip repeating k-mers               	false
Rescore mode                        	0
Remove hits by seq. id. and coverage	false
Sort results                        	0
Remove temporary files              	false
Force restart with latest tmp       	false
MPI runner                          	

Set cluster mode SET COVER.
kmermatcher experiments/small.mm.db experiments/tmp//6404161245249296443/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 13 --min-seq-id 0.5 --kmer-per-seq 21 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 12 --compressed 0 -v 3 

kmermatcher experiments/small.mm.db experiments/tmp//6404161245249296443/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 13 --min-seq-id 0.5 --kmer-per-seq 21 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 12 --compressed 0 -v 3 

Database size: 25000 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X) 

Estimated memory consumption 8 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 25.00K 0s 120ms    
Sort kmer 0h 0m 0s 37ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 219ms
rescorediagonal experiments/small.mm.db experiments/small.mm.db experiments/tmp//6404161245249296443/pref experiments/tmp//6404161245249296443/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 12 --compressed 0 -v 3 

[=================================================================] 100.00% 25.00K 0s 11ms     
Time for merging to pref_rescore1: 0h 0m 0s 8ms==================>] 99.34% 24.84K eta 0s       
Time for processing: 0h 0m 0s 48ms
clust experiments/small.mm.db experiments/tmp//6404161245249296443/pref_rescore1 experiments/tmp//6404161245249296443/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 12 --compressed 0 -v 3 

Clustering mode: Set Cover
[=================================================================] 100.00% 25.00K 0s 9ms       
Sort entries
Find missing connections
Found 597 new connections.
Reconstruct initial order
[=================================================================] 100.00% 25.00K 0s 17ms     
Add missing connections
[=================================================================] 100.00% 25.00K 0s 1ms      

Time for read in: 0h 0m 0s 42ms
Total time: 0h 0m 0s 47ms

Size of the sequence database: 25000
Size of the alignment database: 25000
Number of clusters: 24598

Writing results 0h 0m 0s 4ms
Time for merging to pre_clust: 0h 0m 0s 11ms
Time for processing: 0h 0m 0s 87ms
createsubdb experiments/tmp//6404161245249296443/order_redundancy experiments/small.mm.db experiments/tmp//6404161245249296443/input_step_redundancy -v 3 --subdb-mode 1 

Time for merging to input_step_redundancy: 0h 0m 0s 2ms
Time for processing: 0h 0m 0s 6ms
createsubdb experiments/tmp//6404161245249296443/order_redundancy experiments/tmp//6404161245249296443/pref experiments/tmp//6404161245249296443/pref_filter1 -v 3 --subdb-mode 1 

Time for merging to pref_filter1: 0h 0m 0s 2ms
Time for processing: 0h 0m 0s 6ms
filterdb experiments/tmp//6404161245249296443/pref_filter1 experiments/tmp//6404161245249296443/pref_filter2 --filter-file experiments/tmp//6404161245249296443/order_redundancy 

Filtering using file(s)
[=================================================================] 100.00% 24.60K 0s 11ms     
Time for merging to pref_filter2: 0h 0m 0s 21ms
Time for processing: 0h 0m 0s 56ms
rescorediagonal experiments/tmp//6404161245249296443/input_step_redundancy experiments/tmp//6404161245249296443/input_step_redundancy experiments/tmp//6404161245249296443/pref_filter2 experiments/tmp//6404161245249296443/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 12 --compressed 0 -v 3 

[=================================================================] 100.00% 24.60K 0s 11ms     
Time for merging to pref_rescore2: 0h 0m 0s 19ms=================>] 99.48% 24.47K eta 0s       
Time for processing: 0h 0m 0s 59ms
align experiments/tmp//6404161245249296443/input_step_redundancy experiments/tmp//6404161245249296443/input_step_redundancy experiments/tmp//6404161245249296443/pref_rescore2 experiments/tmp//6404161245249296443/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 2 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 12 --compressed 0 -v 3 

Compute score and coverage
Query database size: 24598 type: Aminoacid
Target database size: 24598 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 24.60K 0s 159ms    
Time for merging to aln: 0h 0m 0s 22ms

25322 alignments calculated.
24830 sequence pairs passed the thresholds (0.980570 of overall calculated).
1.009432 hits per query sequence.
Time for processing: 0h 0m 0s 211ms
clust experiments/tmp//6404161245249296443/input_step_redundancy experiments/tmp//6404161245249296443/aln experiments/tmp//6404161245249296443/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 12 --compressed 0 -v 3 

Clustering mode: Set Cover
[=================================================================] 100.00% 24.60K 0s 10ms     
Sort entries
Find missing connections
Found 232 new connections.
Reconstruct initial order
[=================================================================] 100.00% 24.60K 0s 18ms     
Add missing connections
[=================================================================] 100.00% 24.60K 0s 1ms      

Time for read in: 0h 0m 0s 47ms
Total time: 0h 0m 0s 56ms

Size of the sequence database: 24598
Size of the alignment database: 24598
Number of clusters: 24385

Writing results 0h 0m 0s 4ms
Time for merging to clust: 0h 0m 0s 19ms
Time for processing: 0h 0m 0s 101ms
mergeclusters experiments/small.mm.db experiments/small50.mm.db experiments/tmp//6404161245249296443/pre_clust experiments/tmp//6404161245249296443/clust --threads 12 --compressed 0 -v 3 

Clustering step 1
[=================================================================] 100.00% 24.60K 0s 9ms      
Clustering step 2
[=================================================================] 100.00% 24.39K 0s 43ms     
Write merged clustering
[=================================================================] 100.00% 25.00K 0s 65ms     
Time for merging to small50.mm.db: 0h 0m 0s 22ms
Time for processing: 0h 0m 0s 112ms
user@user:~$ ./mmseqs/bin/mmseqs createseqfiledb experiments/small.mm.db experiments/small50.mm.db experiments/small50_seq
createseqfiledb experiments/small.mm.db experiments/small50.mm.db experiments/small50_seq 

MMseqs Version:	02be0c4ea6183fce9cf45521a8c145d10f3928c1
Min sequences	1
Max sequences	2147483647
HH format    	false
Threads      	12
Compressed   	0
Verbosity    	3

Time for merging to small50_seq: 0h 0m 0s 12ms
Time for processing: 0h 0m 0s 75ms
user@user:~$ ./mmseqs/bin/mmseqs result2flat experiments/small.mm.db experiments/small.mm.db experiments/small50_seq experiments/small50_seq.fa
result2flat experiments/small.mm.db experiments/small.mm.db experiments/small50_seq experiments/small50_seq.fa 

MMseqs Version: 	02be0c4ea6183fce9cf45521a8c145d10f3928c1
Use fasta header	false
Verbosity       	3

Time for processing: 0h 0m 0s 107ms

Context

Input file (small_sample.fa): https://gist.github.com/tomblaze/818f17864d3afeb5475b054a66169571

Output file (small50_seq.fa): https://gist.github.com/tomblaze/e7840473fcfb4992ec709327c358d679

Your Environment

Ubuntu 16.04 (this output from my computer, but bug also occurred on an AWS run)

MMseqs Version: 02be0c4

Downloaded from https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz

martin-steinegger added a commit that referenced this issue Dec 15, 2019
@martin-steinegger
Copy link
Member

Thank you for this detailed bug report. It really helped to find the bug. I introduced it recently.
The issue should be fixed by be1a982.

If you want a set of stickers though (see https://twitter.com/thesteinegger/status/1201076220957315074), sent me your address via mail.

RuoshiZhang added a commit to soedinglab/spacepharer that referenced this issue May 12, 2020
46c843895 Update combine pval agg-mode 3
67d610136 Disable fancy progress bars on travis to reduce output
203a21736 Updated two more tests to use tighter ROC thresholds
a9052f449 Update regression with tighter bounds for ROC tests
c62736a6d Correctly parse keys from data files in filterdb --filter-file This was causing a linsearch instability
fe007cb4e Use MultiParam for gapOpen, gapExtend costs
3513001d3 Add easy-rbh workflow
d0d3032e9 Fix RBH search if using -a to show alignments
ce1a43bf1 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
ea24e4934 Fix issues with abs. path if using aria2c
5228745f5 Improve --alignment-mode parameter description and make it a non expert parameter
fffa9b10e Fix various inconsistencies and usability issues with alignall: * alignall alignment-mode did not correspond to align alignment-mode * add-backtrace did not do anything, has to be specified now if backtrace is needed * Did return a alignment db type even though it is incompatible with that type, uses generic for now * various parameters were passed but unused   - zdrop and scorebias are used now (however see below)   - realign, alt ali, max accept/reject, wrapped are now gone
290668474 Fix wrong warning
813d81f29 Update regression
264d78117 Switch greedy clustering algorithm back to old idea
c09f6574e Improve nucleotide clustering workflow
38a737708 Set k-mers in linclust to 0 for the nucleotide clustering
7df6e3f75 Replace characters that can not be reversed by N in extract frames
e9678f625 Update regression
f886e868f Add nucleotide support to cluster (workflow nucleotide_clustering), clust module will infer identity automatically if missing, Improve low. mem. greedy incremental algorithm, Update regression
5f8735872 Add kmers-per-sequence-scale to linsearch
0310eb607 Change --kmer-per-seq-scale to a multi parameter, add error if cluster is called with a nucleotide sequence
e258bc8d8 Fix #299 PDB70 database creation was not working
7095f37e4 Add support reverse complemente in rescorediagonal --rescore-mode 0 and 1
61ca48883 Fix result2dnamsa
70d014e41 Add search-type 4 to Search
462f24cbb Add module result2dnamsa
5670d990e Fix regression error
e4451d591 Add result direction parameter to kmersearch
12c499dcd Fix reverse sequences issues in linclust and linsearch
44499c3ce Update filterdb regression test
807b4a56a Fix issue soedinglab/MMseqs2#290. Filterdb checked for mode == true but mode was 2.
24479bc27 Fix Docker
a578f52a7 Fix char signedness on PPC
a0d64a989 Update regression
a07a266f9 Working on PPC64LE support
09734177c Remove remaining _mm_shuffle_epi32
cdef78a69 Merge pull request #285 from hgsommer/misc_small
283c8d03f Replace goto end in ssw
6bfc50281 Fix c/p mistake in convertalignments
e61da3447 Fix spelling of 'length'
9a63760fa Replace nested ternary operator
4349b5c6e Avoid repeatedly checking for profile db types
c170a11f5 Call MsaFilter::shuffleSequences() from MsaFilter::filter()
ef49ba220 Return value from MsaFilter::filter()
d155dc36c Replace int by bool literals for bool variable
ec6722adc Align headings with column in PSSMCalculator::printProfile()
548a9bd68 Avoid forward declaration of ScoreMatrix
d0fbe471f Do some cleanup in StripedSmithWaterman.cpp
91d1aeddc Replace check for zero-sized containers by empty()
e47b8eed9 Remove superfluous parameter from ssw_init()
250b1221d Simplify return statements
4fe1116ae Remove counting zero scores in Sequence::mapProfile()
4303728b5 Replace multiplication by zero
1bd602420 Remove increment by zero
e4d4389f2 Move check for exit condition in front of allocations
556d26d1a Clean up function signatures in MultipleAlignment
3863af9ac Move include back to header to restore build
e1208493a Remove unused TmpResult score field
1fd4db8f2 Die if DBReader cannot reopen files (e.g. no more file handles left)
1e21b87ba Purge sequenceLookup early since its recreate in split databases
40854ddcd Prefiltering and CacheFriendlyOperations refactoring
2433e086b WASM work in progress
14014cd0e Fix prefilter overflow instability
e0f971848 Add conda forge to conda install instructions
aa175d636 Fix off by one in kmermatcher soedinglab/MMseqs2#274 (comment)
d1607bc8a Remove LINE_MAX
eca2155d7 Clear string buffer instead of reassigning in swapresults
0f4645edd Fix wrong reverse marking in linsearch reported by UBSAN
5b612a327 Missing mpi binaries for travis regression
83d22417a Next try for ARM compiler flags
7ad122f0a Missed a few variables
ac7914bea Do not require a cmake variable to build ARM
0dcfaadbb Update regression to fix broken samtools call on ARM
29927b4c4 More NEON fixes, we assume signed chars, ARM uses unsigned by default
7760220ff Next try to get the ARM regression to work
cc6d0d52b Add hack to not break travis log size limit
5408c3d10 Try to get NEON to compile
83192cabd Fix search workflow parameters printed twice
f6f001c8c Fix new clang-10 warnings and further travis fixes
259e64341 llvm-10 alias is not whitelisted in travis yet
b1249fd54 Fix errors in Travis YAML from previous commit
18486d4c5 Update travis - use native aarch64 for neon - use xenial - shorten script
98c37f3c3 shortend MultiParam usage, improved line breaks in usage
c9be07f1a Add gcc-9 to travis
2e5fb309a Fix travis clang build
d5865c894 Remove MultiParam g++-9 warning
73679835b Rework target split merging
ca5869397 Fix RESSIZE issue in slice search if sequences are used
491900b99 Improve usage text of cluster/linclust
0166850a2 Remove old greedy incremental clustering code and just run the memory efficient version instead.
15163e64c Fix Verbosity in workflows
aa78af463 Fix issue soedinglab/MMseqs2#274
7846dfce3 fixed clang template error
e1206371c extended MultiParam class, replaced ScoreMatrixFile type by MultiParam<char*>
b88b54756 rewrite alphabetSize as multi parameter
ecb4e35d4 started template class MultiParam to store sequence type specific values
e1a1c1226 changed dbtype comparision in AlignmentSymmetry
2a829aef7 Replace symlinkat call with getcwd/chdir/symlink/chdir to fix Conda build using macOS 10.9 SDK
28e83e8d5 Add OpenMP include to DBReader
fb00aa0c3 Fix realloc issue while IndexTable creation of profiles
504e5021f Take max. seq. len of query and target db in prefilter and alignment
16e235214 Fix bug if seq. len > max seq. length in Alignment
80d0187de Fix asan issue
751f5c19f Make ZDROP an expert parameter, change description text
1b6edd0d4 Rework x detection (SIMD)
9677254ab Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1ac1e6866 Fix max seq issues in prefilter
cb737033c Reset download strategy to not use aria2c for the NCBI download
c95f3ee0e fixed ksw2 test
72b95c0ce Error if we cannot download from NCBI
1d0aad50b Fix databases not piecing togehter all kalamari accessions
516723d53 Merge branch 'master' of https://github.com/soedinglab/MMseqs2
d81b6cca5 added zdrop parameter to control banded nucleotide alignment
e2e39a971 Add Kalamari Contaminants database
c0c538ea3 Various fixes in databases script
08cc95b3a Fix createtaxdb redownloading when taxdump already exists
018eb3498 Remove a bit whitespace in front of each parameter in usage message
8aa7513de add aggregatetax example, fix typos
8bcd7c740 Fix typo
8e581b762 Rework usage texts
7dc25764a Hide most parameters from createindex
2baa609e8 Add examples to many modules
00a7d7696 fixed bugs for long or wrapped nucleotide sequences
a4bdcb478 eggNOG profiles should not depend on the deleted MSAs
4c7830954 Fix eggNOG database construction
f7a5599c8 Cleanup not needed files immediately in databases workflow
3ed3690d4 Fix downloads always restarting in databases workflow
4cfac9a8a Fix aria warning with more than 16 connections
e0a00e10d Revert "Use SW instead of BandedNucAln if we don't have diagonals"
7ac966b2e Fix result2msa could fail if it was writing compressed output
95729ac7c Fix wrong output DB type written in alignall
f899e7c7a Use SW instead of BandedNucAln if we don't have diagonals
c08d9fa8e Allow parameter descriptions to span multiple lines
57868498e MMseqs2 is not limited to proteins, update README to reflect that
11818b0a2 Cleanup hiding parameters in workflows
c481cea60 Remove some useless includes
2f64aeeb8 Fix databases timestamp appending instead of overwriting
ae9e9e329 Add eggNOG setup procedure to databases
31c8e5d50 Shorten two short parameter descriptions
2f49d3e3e Read header from lookup in msa2profile if available
1356869b0 add option to reverese profile dbs
ac3482e80 More issues with zlib and tar2db
aaafafe43 Fix tar2db keys
c751d9e2f More tar2db fixes
a9c93014c Fix variadic input to tar2db
51a761305 Add tar2db module to convert content of any tar to a DB
96f9a91e5 Use nedmalloc on Windows/Cygwin
73f5c2a2d Add databases workflow to README
5a7ac9e54 make align output consistent
c5ebe5297 fixed setcover cluster mode (by fixing bug in similarity reading for short aln results e.g. hamming distance aln)
481696b5f Fix databases output
c6b4a57a8 Beginning cleaning up parameter descriptions
a9552a177 Show default value of bool parameters
af89c4677 Add a proposed example text structure
9c17f4eba Rework module description texts, better categories, shorten all descriptions, prepare to replace long descriptions with examples
00ff199e8 Add Resfinder DB
f1011ecb4 Fix krona again marked as vendored
02001ab03 missing mode resulted in different top1
4375463bc Header db should not have to be a unsplit db
edccbf33f Actually fix extractorfs lookup creation
041e8e558 Improve README
a8f2c7bad Remove correct workflow script in createtaxdb.sh
26c8202a9 print createdb cmd line again
df02bae34 Refactor createseqfiledb, remove stringstream
2523ebe1a do not write null byte
af847a724 Fix clang warning from DBConcat
ef1ec596f extend dbconcat to handle auxillary files
528bd2134 not needed
dec1b9215 Silence warning in GCC 4.8 casting function to void*
2d44c886d Fix extractorfs not being able to create lookup
ffe66afac Replace isnumber with isdigit. Add more tests to TestTaxExpr
fbe09867e Rework Taxon Expr parsing
f58329ef5 Add constructor to define custom functions to ExpressionParser
b6ef07281 Initialize expressionparser per thread, was not thread safe
f966bfa62 Fix reallocation issue in BandedAlignment
bbd3c2bb7 Add +1 to realloc in BandedNucleotideAligner but not to length
6b6e82ae6 Add +1 to realloc in mapSequence
75e2c8ec4 Fix off by one issues in realloc in rescorediagonal and BandedNucleotideAligner
afd14c8c2 First step to get rid of maxSeqLen
13ca612db Fix allocation issue in kermatcher if sequences are longer than > 2^16
62de5ba93 Fix off by one in computation for splits in kmermatcher
35e95d180 Change int_sequence to char (big change)
ecf82f2f4 Revert "Temporarily disable soft split mode for createdb in easy workflows"
d19219dd4 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1a0d898ec Fix softlink issue in createdb soedinglab/MMseqs2#265
13e0fe466 Temporarily disable soft split mode for createdb in easy workflows
4487b6e14 Fix view module to work with softlinked createdb dbs
c1e9eb0e3 Fix MPI issue if only one server is used
e781c3fe5 fix MPI compile error
9bcff2844 Fix Filter2 bug of HH-suite in MMseqs2 soedinglab/hh-suite#182
01db79d33 Fix some bugs in splitting handling
d9a887453 Fix memory splitting issues in kmermatcher, kmerindexdb
37880f083 Fix MPI in kmermatcher and indexdb
bee93123f Update regression
03a89ff1c Merge branch 'master' of https://github.com/soedinglab/mmseqs2
6ca967362 Update the way how k-mers are extracted in kmermatcher. Extraction should be now ~3 times faster.
f1388309d Introducing databases workflow to automatically setup and download common databases
d78fdbb06 Add progress to convertmsa
18acba224 Do not recreate _mapping file if it already exists in createtaxdb
63a373f5a Skip validations steps correctly if a input db is neither INPUT nor OUTPUT
d95caa1a7 Allow modules with zero parameters
9f8aff948 Allow modules to handle -h or --help themselves
cf5691f92 Typo
8ebc9d16b fixed access mode
31895414d Clarify parameter help in createdb
f644744a8 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
c287719d9 Remove check for profiles for splice serach. It should also work with sequence databases.
c75fe9acf regression submodule w filtertaxseqdb
7587a872f Add one more missing check in kmermatcher
8d4e9f4fc Remove +1 from size in initKmerPositionMemory
aca141e95 Fix shellcheck error in splicesearch
8bdff50e1 Move +1 from initKmerPositionMemory outside
f12821e35 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
d74b76ca5 Avoid overflow in kmermatcher if split is needed
fd90ff2c3 Move compiled data resources into subfolders
2fd9f25d2 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
b439ce831 Make the slice search applicable to other databases types, not just profiles
589a2e276 Fix apply crashing on empty entries
82542a6ac Merge branch 'master' of https://github.com/soedinglab/mmseqs2
c0acdd8f3 Fix memory leak in createsubdb.
5129a956d Validate taxonomic ranks and make input/output formats consistent
53bb55b38 Fix issues in hash function soedinglab/MMseqs2#252
764c4a3e7 Fix lca message
c013a6929 Fix LCA output message
a1206690d Change db validator from result2stats
714f5b4fb Replace mmaped input file with std c io in createsubdb
6e43e9413 Add remove .source file to rmdb
3e58bb85b Fix result2flat soedinglab/MMseqs2#261
3e27833db Revert easycluster.sh back to result2flat. Reason is that createsubdb can not handle soft linked sequence databases (input.0 -> input.fas)
33354680f Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1e92fb504 Replace result2repseq and result2flat with createsubdb and convert2fasta
55bcdd303 single step clustering could potential cluster unrelated sequences due to hash collisions
fdd0646b1 Fix clusthash issues with parallelization and nucl input
e62a1c717 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1336b7ad2 Add MSA to allDb and allDbAndFlat
48a037a2e Update Prefiltering.cpp
a1adbf52d Fix warning: Remove useless copy constructor from Matcher::result_t
d3ca42657 Remove truncatedCounter variable in QueryMatcher
4647525ec Show full help text if "Error in argument " occurs
4149ae457 Remove annoying message in prefilter (truncated result). Move it to the statistics section.
d5aab5b86 Update regression
1f1e049e6 Fix output of unclassified hits in convertalis
83ff5c601 Fix permission issues for tmp directory
cce6e6714 add support to output taxon in easy-search when using an indexed database
f200bdd62 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
6f28a29ae Fix seg. fault if all sequences could be classified
473d60580 Update batches
b52668f6e Add chat icon
af54c8e8e Update README.md
7eb6a0b70 Makde addtaxonomy more resilient against invalid taxonomy mappings
3482b0e91 Merge pull request #260 from RuoshiZhang/master
36f49f5b5 Fix issue in memory computation for split
bcb97d63f Update README.md
abcd97de7 write same number of fields even if no hit
38e102181 Update regression to hopefully fix windows failure
f41511465 Fix spelling error
1fd24924e Add a search-type 4 for trans-trans search returning a nucl backtrace in offsetalignment
31f6d7ac3 add aggragatetax to assign set tax by majority vote
b6e8ee239 allow more dbtypes in swapdb
c9d02ef21 add option to view rank index
49db7258e typo fix
9c32930f3 Merge branch 'master' of github.com:soedinglab/MMseqs2
17b5494fe Fix auto detection of dbtype in createdb
8831df81d Merge branch 'master' of github.com:soedinglab/MMseqs2
be1a9822c Fix createseqfiledb soedinglab/MMseqs2#258
02be0c4ea Fix summarizeresult to support reverse position in alignment
7ef586276 added filtertaxseqdb
00f2fd2b8 added mode for all but index
127db8c6d minor tidying for filtertaxdb
8144e7653 Merge branch 'master' of github.com:soedinglab/MMseqs2
48f77fa7d Fix ASan issue in filterdb
d722d5724 Fix warning in filterdb
4a4e6ea15 Update regression test for filterdb
31a7dc124 filterdb --join-db ignores lines it cannot join instead of crash
6c6faa96d filterdb's --extract-lines works together with --trim-to-one-column
12bee8142 filterdb can filter by rows with value within percentage #249
5c919ab95 Allow double parameters separately from floats in parsing
f9be8a88d Remove broken filterdb paths
1dc04f5e1 Refactoring of filterdb
90e3a9aaf Fix bug for enforced dbtypes in createdb
a4cee78db New regression to check stdin support
17ec97c78 Add stdin support to easy workflows
76c9e7c36 Fix compiler warnings in KSeqWrapper
0cc45536b Overwrite dbtype correctly in createdb
c0045182b Add stdin to createdb
02a88e438 use https instead of ftp for downloading taxdb data
a33bd27f4 offsetalignments now correctly returns a nucleotide backtrace if needed
456e1b5ab include VTML40 in binary for easier access
775de3850 Add missed target .source file for reading in convertalis
c08c071b2 Overload patterncompiler isMatch for pos of match
ba6aa8d12 avoid appending extra tabs besthitperset

git-subtree-dir: lib/mmseqs
git-subtree-split: 46c8438958edccd8fd09640eb174e2449529e4df
@vinisalazar
Copy link

Hi @martin-steinegger, I've been getting this exact same error, not sure what could be happening. I am using the latest version of mmseqs2 (14-7e284). What other information could I provide to help debug?

Thank you,
Vini

@milot-mirdita
Copy link
Member

Can you please make a new issue, with the output of MMseqs2 and excerpts of your result files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants