Releases: soedinglab/MMseqs2
MMseqs2 Release 15-6f452
MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1
). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.
Breaking
- Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (2568829) Thanks @bbuchfink
New Features and Enhancements
- Implement additional
prefilter
modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9) - Added
createclusearchdb
andmkrepseqdb
modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458, 80f8b0b, 542f362, ad6dfc6, 91f2a6a, 8310cd6, 0019026, 76b7df1) - Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32e)
- Rework
ungappedprefilter
to improve performance and expose additional parameters such as taxon filtering and db-load-mode toungappedprefilter
(8a89305, 800eb09, eb01b5b, 20d3afc) - Added
gappedprefilter
module for Smith-Waterman prefiltering, similar toungappedprefilter
(df77d9e) - Reworked
pairaln
for the ColabFold greedy taxonomy pairing mode (1514015) - Implemented experimental module for A3M filtering (167bbd1, 499bb73)
- Implemented weighted clustering (bd080e6, b36070a, fd1837b) Thanks @AnnSeidel
- Precomputed indices without k-mers can be created with
--index-subset
(314c1f0, 8fe3bf9) - Add
result2neff
module to extract Neff scores (4148e09) Thanks @neftlon - Add
ppos
format-output toconvertalis
for count of positive substitution scores (5edc79b) Thanks @Dohyun-s - Speed-up FASTA parsing in
kseq.h
with memchr (98406dd) Thanks @valentynbez @kloetzl
Bugfixes
- Add min and max modes for
result2stats
(19dce03, 61e7734) Thanks @ClovisG - Fixed a segmentation fault in ca3m with the same database (f5f780a) Thanks @ClovisG
- Fix crash when some input file sizes are an exact multiple of 4096 in
convertalis
andgff2db
(712f288) Thanks @RuoshiZhang - Fixed issues for GTDB r214 database creation (4b52296) Thanks @apcamargo
- Fix source number being limited to 16-bit (65k) (1d62fa0)
kseq
now correctly handles input sequences larger than 2^31 bytes (07ca4a7)- Fixed
unpackdb
to work without a.lookup
file and added support for writing compressed files (92d8cc3, 570e3ed) createindex --check-compatible
check the k-mer threshold correctly now (bb0a1b3)- Fixed
prefilter
exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55f) - Corrected handling of multiline checks in
createdb
(6b93884) - Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b) Thanks @AnnSeidel
- Fixed logic in reciprocal-best-hit by removing
resAB_sort
(3bcbdba) Thanks @StephanieSKim - Corrected handling of differently ordered parts of sequence databases in
concatdbs
(ea17d30) - Fix
--single-step-clustering
misspelled in cluster warning (fa6c093) Thanks @valentynbez
Build and Compatibility Updates
- Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad, 3e43617, b341b66, 932d32b) Thanks @A-N-Other
- Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d, 05132de)
- Updated regression testing to fix errors in MPI test (2113766)
Developer
MMseqs2 Release 14-7e284
This is a major release containing features implemented for ColabFold, Foldseek, MMseqs2 profile-profile (not published yet, and still in preview) and many bugfixes. Thanks a lot to the contributors who submitted bug fixes.
If you are using the Docker Hub based MMseqs2 containers, please switch to the new Github Container Registry based ones. The Docker Hub containers will not be maintained in the future.
Breaking
- Profile databases created by previous MMseqs2 releases won't work anymore with this release. Please recreate them from previous search results or MSAs with
result2profile
or `msa2profile. - Profile k-mer threshold parameter were fitted to new pseudo-counter parameter (
--pca
,--pcb
). Previous--k-score
parameters will have differing sensitivity. However, most users will have set-s
instead, which was fitted to match as closely as possible.
Features
gff2db
now should actually work correctly after refactoring (488df86, thanks @RuoshiZhang)result2msa
now supports reading from precomputed index- Add
db2tar
: Create a tar file from a database - Add parsable columnar tsv output to
databases
with--tsv
- Add taxonomic filtering during
prefilter
with--taxon-list
- Add
--comp-bias-corr-scale
to adjust the weight of the compositional bias correction - Add
--mask-prob
parameter to adjust tantan's masking threshold - Add context specific pseudo-counts to
result2profile
- Add iterative profile-profile search workflow (thanks @haydenji0731)
- Add support for profile-profile scoring in striped Smith-Waterman algorithm (thanks @haydenji0731)
- Add support for gap-open/gap-close costs to striped Smith-Waterman algorithm (thanks @hgsommer)
- Add environment variable
MMSEQS_IGNORE_INDEX
to ignore an existing precomputed index createsubdb
and view can now return results from identifiers in.lookup
with--id-mode 1
- Change
compressdb
loop toomp static
to keep order - Improvements to nucleotide alignments and scoring (thanks @AnnSeidel)
Features built for ColabFold now available in MMseqs2
- Add
pairaln
: taxonomic pairing on sequences for MSA building (9a0df0d, 5e245d1, 3f8695e, 3e92abf, edb8223, e19df7c) - Add A3M support to
result2msa
(--msa-format-mode 5
) - Add A3M support with alignment information to
result2msa
(--msa-format-mode 6
) result2profile
allows--diff 0
- Make taxonomy mapping mmap'able for (near) instant read-in
- Add workflow to create expandable profile (profile-profile) db from TSVs
tsv2exprofiledb
- Enable
result2profile
/filterresult
to read new expand alignment index - Add support to filter MSAs in buckets
filterresult
,result2profile
- Add
--filter-min-enable
to enable filtering only above a minimum threshold of hits (c6d8ae0) - Expand can filter in each target cluster before expanding (75af0c8, 85ce847)
Bugfixes
summarizeresult
was rejecting hits that match the coverage threshold exactly (#586, 67949d7)- Don’t use reserved filename characters in unpackdb (#467, c663497 thanks @cutecutecat)
- Fix typo (violoations -> violations) (#526, 74c3aa6, thanks @Benjamin-Lee)
- Fix potential endless loop in
rescorediagonal
- Fix prefilter/alignment with 0-size query input #433
- Fix
unpackdb
parameter parsing issue - Make sure
FILTER_RESULT
variable is always correctly set for exhaustive search (d4a3354) tar2db
breaking with--tar-include/exclude
(#561)- Wrong database name printed for variadic input when creating a tmp directory
extractorfs
sometimes loading invalid start/stop codons on non-avx2 platforms- Don't mask consensus sequences in profiles
result2msa
correctly prints X residues- Allocate
CSProfile
only if it's going to be used (d873697) - Taxonomy db paths are now correctly found if given a precomputed index (8ff26f2)
- Encode more strings internally as base64 if special characters are used (16b5774, d155586)
- Disable broken iterative profile searches in taxonomy (#432)
- Fixed a possible segmentation fault in
align
(thanks @rchikhi)
MMseqs2 databases
- Added VOGDB
- Updated dbCAN2 to V9 and removed
.aln
suffix from profile names - Fix issues with ResFinder (#494, 56816b3), GTDB (#561, 678c82a), Kalamari (#531, ce7bf53), Uniref (#496, e85ceb9, thanks to @fanhuan)
Speedup
- Rework of
result2msa
to avoid allocating a lot of memory - Improvement of speed for ungapped alignment in
prefilter
TaxonomyExpression
is faster with a single tax identifier (8ff7279)
MMseqs2 subprojects
- MMseqs2-based subprojects can use
databases
too (5afd33c) - Add
appenddbtoindex
: augment a precomputed index with other databases in sub-projects - Allow subprojects to build their own precomputed indices (a506d67)
- Add support for external k-mer thresholds for the prefilter (fea8d20)
- Subprojects can define their own DbType validators
Developers
- Added CirrusCI to test FreeBSD and old compilers (a2e2129, 904d0c6, a09a704, 4f1996a, 482dedc, 16830a5)
- MMseqs2 Docker containers are now published in the Github Container Registry (eb203d3, 5185d3c, ba4e11f)
- Our microtar fork can write tar files again (dcd180b)
- Add URIs as allowed parameter inputs (3b9cf88)
- Additional s390x fixes (linclust might work now)
- Add support for new MultiParameter type
- Bundled SIMDe was updated (thanks @mr-c)
MMseqs2 Release 13-45111
New Taxonomy Workflow (new feature and breaking change)
We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:
The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0
parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.
As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4
) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.
Breaking changes
--slice-search
in now called--exhaustive-search
- Unify
--compress
--summarize
--omit-consensus
(inresult2msa
) to--msa-format-mode
Features
- Add GTDB and CDD to databases downloader #410
- Add
nrtotaxmapping
to create taxonomy mapping from NR - Add
unpackdb
to split a database into separate files #406 - Add
majoritylca
module for majority voting based taxonomy from alignment results - Add
cpdb
andlndb
- Taxonomy information is stored in binary format (a single
db_taxonomy
file, instead ofdb_{named,nodes,merged}.dmp,db_mapping
) to speed up read-in. Old format is still supported. --exhaustive-search
is usable with ungapped alignments (--alignment-mode 4
)- Allow sequence/result database input in
taxonomyreport
#401/#408 msa2profile/result
can skip the first sequence with--skip-query
createtaxdb
can create a taxdb by mapping through.source
in addition to.lookup
(--tax-mapping-mode 1
)splitsequence
can create a sequence database with original headersalign
can return short cluster format if only identifiers are required--alignment-output-mode
tar2db
can be used multi-threaded if input allows (e.g..tar
containing.gz
files)- Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
- Split non-index parts over additional files in split index case to reduce peak memory use
proteinaln2nucl
can now compute scores and e-valuescreatedb
can create a sequence database from a database containing fasta files (e.g. created bytar2db
)- Add
MMSEQS_FORCE_MERGE
environment variable to force generating fully merged databases - Improved many descriptions, warnings and error messages
Bugs fixed
- Fix
filterresult
off by one issue removing wrong sequences - Fix
addtaxonomy
always crashing due to invalid check #355 - Reduce numbers of calls to
posix_memalign
to fix lock contention on macOS extractorfs
doesn't flood warnings due to short sequences anymoreexpand2profile
--pca
is correctly set to0
msa2profile
always copies.lookup/source
files instead of symlinking- Clustering of clustering input would not work with set-cover or connected-component
- Short circuit
--cluster-reassign
if nothing can be reassigned - Fix temporary files not getting removed in
linclust/cluster
with--remove-tmp--files
- Fix
kmermatcher
setting user k-mer pattern in auto k-mer selection and breaking - Krona
taxonomyreport
was not working if no sequence was unclassified - Make
Matcher::resultToBuffer
buffer sizes consistent (could crash with very long backtraces, needs further refactoring) - Fix multiple locations where
Util::checkAllocation
could never be called as it would have crashed before - Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to
--sub-mat
) taxonomyreport
andaddtaxonomy
parameter were not adjustable ineasy-taxonomy
- E-value parameters are now correctly parsed as doubles instead of floats #379
- Add symlinks to
splitdb
#376 - Increase maximum number of open files in
DBReader
- Include file size and modified date of inputs in
temporary
file hash calculation #372 --cov-mode 5
was not working #371- Database downloader deals correctly with redirects now
result2profile
could crash if target database contained much longer sequences than query database- Stop symlinking header database (and other ancillary files) in
filterresult
Developer
- Add vector of predefined substitution matrices to add additional matrices in subprojects
- Don't create false
_has_{builtin,attribute}
macros (see simd-everywhere/simde#691 (comment)) - Add
USE_SYSTEM_ZSTD
cmake flag to use system provided zstd #411 - Replace texlive with tectonic for faster/prettier userguide
- Add more instructions to
simd.h
- Add initial fixes to get MMseqs2 working on s390x (work in progress)
- Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
- Build ARM64/PPC64LE binaries by cross-compiling
- Add missing licenses and READMEs for vendored libraries #403
- Update ALP to 1.98
- Update xxhash to v0.8.0
MMseqs2 Release 12-113e3
Breaking changes
- Remove
--add-internal-id
parameter fromresult2msa
filterdb --shuffle
is now randomly instead of deterministically shuffled- Taxonomy expressions in filtertax(seq)db interpret
,
as||
now #320 convertalis
pident
output field now correctly reports percentage (0-100) sequence identity instead of fraction (0.00-1.00), usefident
to print the fraction instead
Features
- Support nucleotide clustering in
cluster
andeasy-cluster
- Support other architectures (SSE2/ARM64/POWER8/POWER9/etc) through SIMDe
- Linclust is much faster on systems with a lot of CPU cores
- Clustering update is faster, more stable and correctly deals with deleted sequences #272
- Add easy workflow for reciprocal best hit searches
easy-rbh
- Add SILVA, Pfam-B, dbCAN2 to
databases
databases
produces taxonomy information for NR- Replace old greedy incremental clustering with new memory efficient version
- Add
result2dnamsa
module to create MSAs of nucleotide sequences - Continued progress on profile-profile searching (
result2pp
,expandaln
,expand2profile
) , stay tuned! - Add multi-parameter to support to overwrite sequence type specific parameters: e.g.
--gap-open "nucl:5,aa:11"
- Add ORF information as output options to
convertalis
(qOrfStart/qOrfEnd, dbOrfStart, dbOrfEnd
) - Speed up sorting using ips4o
- Speed up masking through new version of tantan
- Speed up multi-threaded writing of clustering results
- Speed up reading of database indices and merging target split databases
- Add memory tracking to account for index size when computing available memory (
--split-memory-limit
should be more reliable when searching/clustering billions of sequences). - Add
--search-type 4
(translated/translated search) tocreateindex
- Add
convertalis --format-mode 3
HTML output based on MMseqs2 app (app.mmseqs.com) - Improve memory management in
result2msa
andresult2profile
modules - Add
msa2result
module to create an alignment result db from MSAs - Add
filterresult
to slim down result dbs with pairwise HHblits filtering #316 - Add
--kmers-per-sequence-scale
tolinsearch
to extract a k-mer fraction instead of a fixed count - Add a random integer to
--local-tmp
path to avoid race conditions if multiple MMseqs2 happen on the same machine - Add
--max-seqs
toungappedprefilter
- Add
--tax-lineage-mode 2
parameter to print numeric taxids
Bugs fixed
rbh
workflow was broken due to issues withfilterdb
- Fix
-a
in RBH search to show alignments - Fix PDB70 database creation in
databases
- Fix aria2c download support
- Fix memory issues and MPI in kmermatcher
- Fix memory issues in
extractorfs
when using AVX2 - Fix
--cluster-reassign
to respect--cov-mode
- Set-cover supports up to 2^32 sequences (previously crashed with more than 2^31)
- Exit correctly if there is not have enough disk space instead of crashing in the next module
- Fix
prefilter
order instability when searching very redundant databases - Correctly parse keys from data files in
filterdb --filter-file
, this was causing instability inlinsearch
- Allow overwriting string parameters with empty strings
- Fix ASAN issue in
extractorf
when using AVX2 - Microtar would try to seek backwards constantly resulting in horrible gzip read performance
- Avoid lookup writing to corrupt memory if an accession is too long
- Fix various inconsistencies and usability issues in
alignall
:--alignment-mode
inconsistent withalign
module--add-backtrace
did not do anything
- Fix restart of clusterings using reassignment
cluster --cluster-reassign
- Fix createdb did not correctly read gz/bzip files with
--createdb-mode 1
#323
MMseqs2 Release 11-e1a1c
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases
module helps to download and setup database. We now have a chat support at chat.mmseqs.com.
Known Issues
rbh
crashes due to invalid sorting mode (#290)- Homebrew's macOS version does not use multiple cores (#289)
prefilter
results can be unstable between different runs for extremely redundant databases (#277)linclust
/cluster
can crash for very small input sets (#274)
Breaking Changes
kmermatcher
--skip-n-repeat-kmer
parameter was replaced with--ignore-multi-kmer
Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
Either mode is only used in Plass and not in Linclust--lca-ranks
from(easy-)taxonomy
andlca
has to be delimited with semicolons (;
) instead of colons (:
)--dont-shuffle
flag was renamed to--shuffle true/false
Features
- new
databases
workflow to list and download common databases.
Supported databases:
Name Type Taxonomy Url
- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
- NR Aminoacid - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB Aminoacid - https://www.rcsb.org
- PDB70 Profile - https://github.com/soedinglab/hh-suite
- Pfam-A.full Profile - https://pfam.xfam.org
- Pfam-A.seed Profile - https://pfam.xfam.org
- eggNOG Profile - http://eggnog5.embl.de
- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari
(easy-)search --slice-search
is now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by--disk-space-limit
createdb
and the variouseasy-
workflows learned to read query input fromSTDIN
taxonomyreport
learned to display the summarized taxonomy result with Krona- new
filtertaxseqdb
module for filtering sequence DBs with taxonomy information according to provided taxa --taxon-list
parameter understands expressions. E.g. get all bacterial and human sequences--taxon-list "2||9606"
easy-search
andconvertalis
can now output taxonomic information using--format-output
taxid Taxonomic identifier
taxname Taxon Name
taxlineage Taxonomic lineage
- speed up in
(easy-)cluster/linclust
by improving k-mer extraction - MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
E.g.:mmseqs createdb input1.fa input2.fa seqDB
each sequence in seqDB can tell if it came frominput1.fa
orinput2.fa
createdb
learned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new databasealign
andrescorediagonal
learned to align circular sequencesalign
exposes the z-drop parameter of its Banded Nucleotide alignment algorithmreverseseq
learned to reverse profilesfilterdb
can filter rows with value within given percentage of first row- new
aggragatetax
module to assign a taxonomic label to contigs according to the fragments matched on the contig - Adjusting
--max-seq-len
is not required anymore, MMseqs2 automatically increases the length now. - MMseqs2 on Cygwin/Windows uses
nedmalloc
as its memory allocator now and does not massively slow down due to lock contention - new
tar2db
module to efficiently transform content oftar
archives to MMseqs2 databases
Bug fixes
createindex
would create corrupted indices for profile target databasesrbh
workflow would create its result DB at an unexpected (wrong) location(easy)-taxonomy --lca-mode 3
(Approx. LCA) was aligning invalid sequences in the second iteration and producing bad resultslca
(and(easy)-taxonomy
) add empty columns for unclassifed sequences to be valid TSVskmermatcher
uses xxhash for hashing now (faster)kmermatcher
avoid crash machine has not enough memory to process data at once (affects linclust/cluster)kmermatcher
correctly deals with sequences longer than MAX_SHRT nowkmermatcher
fixed various edge cases (e.g. alignment of 1-char sequences)kmermatcher
hash-shift would be ignoredoffsetalignment
could produce wrong results in the minus-strandclust
now correctly and consistently handles alignment DB inputclusthash
better deals with nucleotide input now and several multi-threaded inefficiencies were resolved(easy-)cluster
--single-step-clustering
could cluster unrelated sequences due to hash collisionsprefilter --diag-score 0
respects--min-ungapped-score
createseqfiledb
could print empty sequence linestaxonomyreport
could crash if no sequence was unclassifiedresult2flat
could crash with long sequence inputresult2msa, result2profile, msa2profile
backport filtering fix from HHblitsalign
could produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DBsplitsequence
fix issues with splitsequence if combined with compressedresult2profile
fix Filter2 bug of HH-suite in MMseqs2apply
would crash due to reading wrong entry lengthsfilterdb --filter-expression
was not thread safe and could corrupt resultsfilterdb
--extract-lines
and--trim-to-one-column
are compatible with each other
Developers
- Internal representation of sequences changed from 4-byte per character to 1-byte per character
- Compilation under AppleClang + libomp works now (see
util/build_osx.sh
) - Tools inheriting from MMseqs2 can now add their own citations
- MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed
symlinkat
call; relevant for bioconda)
MMseqs2 Release 10-6d92c
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster.
Known Issues
- High sensitivity searches (higher than -s 6) with precomputed indices should fail. Pass
--db-load-mode 3
as a workaround to the MMseqs2 call.
Breaking Changes
- Default taxonomy mode is assigning the same taxonomic label as the top hit. The previous "approximate 2bLCA" mode can be used with
--lca-mode 3
or the non-approximated 2bLCA with--lca-mode 2
- MMseqs2 will refuse to compile on compilers without OpenMP support (Use
-DREQUIRE_OPENMP=0
to force a single-threaded no OpenMP build) - The confusingly named (and probably non-functional)
--global-alignment
parameter is gone - File names of the latest precompiled binaries changed. All archives contain a copy of the user guide and the MMseqs2 binary in the same subfolder (see further down for binaries of release 10-6d92c):
SIMD | Linux | macOS | Windows |
---|---|---|---|
SSE4.1 | mmseqs-linux-sse41.tar.gz | mmseqs-osx-sse41.tar.gz | mmseqs-win64.zip |
AVX2 | mmseqs-linux-avx2.tar.gz | mmseqs-osx-avx2.tar.gz | - |
Known Issues
- MMseqs2 on Windows seems to not scale well on multiple threads
- MMseqs2 on Windows can crash when built with AVX2 support (mostly on VMs)
Features
createindex
can precompute split indices to improve runtime when searching against a database that is larger than the system memory. Precomputed databases also require less overhead RAM, since only the required parts are loadedeasy-search
,easy-taxonomy
,easy-linclust
andeasy-cluster
workflows can take any number of query FASTA or FASTQ files- MMseqs2 validates database types. It will exit with an error message on wrong input, where it would previously crash
kmermatcher
reports the diagonal with the most k-mer matcheskmermatcher
scales the number of k-mers with sequence length (--kmer-per-seq-scale
)rescorediagonal
got two new rescore modes, one for global alignment scoring and one for scoring a quasi global alignment fullfilling a local window criterion- Peak memory usage for reading in very large databases is greatly reduced. 128GB nodes should comfortably be able to deal with up to the maximum of 4.2 billion entries
- Parameters taking byte values support syntax with a SI suffix (e.g.,
--split-memory-limit 64G
) - Nucleotide substitution matrices should be user definable
- Taxonomy report is compatible with Pavian. Thanks to Florian Breitwieser!
cluster
workflow learned a reassignment mode--cluster-reassign
. This mode corrects errors that occured because of cascaded clusteringextractorfs
can directly translate a nucleotide ORF to an amino acid sequenceresult2stats
can write TSV filescreatesubdb
supports softlinks instead of always hard copying the whole file to disk- reduced harddisk space usage for all cascaded clusterings
easy-taxonomy
reports the top hit alignment as a separate output file with the suffixtophit_aln
createindex
checks if an index needs to be recomputed were improved
Bug fixes
- MMseqs2 did not compile on FreeBSD. Please let us know about free continuous integration options to make sure it will keep working in the future
proteinaln2nucl
could return wrong coordinatesapply
would deadlock when running with multiple threads- MPI searches are way more reliable, there were various issues around merging the separate results. MPI logic of split and merge is also integrated into the regression tests suite
prefilter
splits nucleotide searches if not enough memory is availablekmermatcher
could corrupt memoryrescorediagonal
could produce wrong sequence identities when aligning mixed-case sequences- macOS builds were not actually static (still dynamically link libsystem however)
lca
module could corrupt memory and crashcreatedb
does not crash on systems with only 4GB of RAM anymore- AVX2 and SSE4.1 builds could produce slightly different results
summarizeresults
does not crash on empty alignments results anymore- fix wrong tophit_report in
easy-taxonomy
- Precompiled Windows builds were broken
- Precomputed indices of databases with very short sequences could truncate alignments if the query sequences were longer
Developers
-
Tools using MMseqs2 as a framework do not need to export MMseqs2 modules again anymore
-
MMseqs2 uses Azure Pipelines for all platforms to run our regression tests suite and provide precompiled binaries
-
MMseqs2 runs under ASan without any issues. We fixed various small memory leaks
-
The regression suite is directly linked through a submodule
It can be used by running:
git submodule update --init ./util/regression/run_regression.sh $PATH_TO_MMSEQS/mmseqs $TMP_DIR
MMseqs2 Release 9-d36de
At a glance: Improved taxonomy, add colors to user output, improve computation progress bar, small speed ups and many bug fixes
Features
- Add support for Kraken style taxonomy reports. Thanks to Florian Breitwieser
- New easy-taxonomy workflow
- New progress bar to reduce output
- Colored errors and warnings
Bugs
- Fix alignment problem in SSW library mengyao/Complete-Striped-Smith-Waterman-Library#61
- Fix iterative profile search
- Fix protein nucleotide index issues
- Fix cluster update workflow
- Fix critical multi threading bug in taxonomy workflow
MMseqs2 Release 8-fac81
At a glance: Faster searches and clustering through improved IO and better seeding. More search modes like tblastx, reciprocal best hit and linsearch. New output format SAM. Support for compressed databases to reduce hard disk and memory requirements.
Known Issues
- Iterative search only works up to 2 iterations
Breaking Changes
- MMseqs2 now saves a lot on IO by not merging result datafiles
There is still a single.index
file, but the corresponding data files are split into multiple parts (as many as threads were used previously) - MMseqs2 now uses the VTML80 [1,2] substitution matrix to speed up the prefiltering (changeable by
--seed-sub-mat
), the final alignment is still computed with the Blosum62 (still changeable by--sub-mat
) - All databases have now a
.dbtype
file - MMseqs2 Docker image is now based on Debian instead of Alpine
- Changed Orf header format to be more space efficent. The new format is now
orignIdentifer startPos(-/+)len flag
prefilter
returns ungapped-alignment scores instead of e-valuescreateindex
the file extention is now.idx
instead of the previous.[s]k[6,7]
format
Features
- Support for tblastx-style nucl-nucl translated searches
mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 2
- Support for nucleotide searches
mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 3
convertalis
has learned to return SAM formatted output (preview)- Database can be compressed by applying zstd on each entry (
--compressed 1
)- Also added
compress
anddecompress
modules
- Also added
rbh
workflow for reciprocal best hit searches addedlinclust
can now cluster nucleotide sequences on both forward and reverse strand- Added
linsearch
, a lightning fast search for proteins and nucleotide sequences (preview; easy workflow varianteasy-linsearch
also added) createlinindex
computes an index forlinsearch
taxonomy
uses--orf-start-mode 1
to annotate more sequences- Added approx. 2bLCA to speed up computation, this is now the new default. The old mode can be turned on by
--lca-mode 2
createdb
recognizes sequences containing Uracil as DNA sequencescreatedb
is now faster through speeding up its shuffle operationsview
module to view single entry in an MMseqs2 databasealign
module has learned--min-aln-len
parameter to filter by minimal alignment length- Alignment modules (
rescorediagonal
,align
) can align longer sequences now (not limited to 2^15 length) - Input sequences can now be softmasked (lower letter masking) instead of only hard masking (replacing with X) ``--mask-lower-case
. The masking only applies to the prefilter stages
kmermatcher` or `prefilter` and can be combined with `--mask` filterdb
has learned--filter-expression
parameter and mode that allows filtering by simple mathematical expressionsalignbykmer
can be used for nucleotide searches- MMseqs2 did-you-mean functionality gives better suggestions
- MMseqs2 does not repeat the whole parameter list for each submodule call anymore
Bugs
- Default parameters of
map
workflow are now set correctly - Some modules were using the wrong coverage parameter
- Sliced profile search was losing high E-value hits
- Sliced profile search is now stable
- Profile-Sequence alignment E-values where slightly too high
result2msa
was crashing with profiles on the target sideresult2msa
should not crash with--alow-deletion
anymore- Some parameters were never visible (with or without
-h
) - Various issues with MPI were resolved
Developers
- Continous integration enforces no compile warnings now
- Continous integration now tries to build AArch64 builds with Docker and Qemu
- We added a first draft of our developer guide to the wiki
References
[1] Müller T & Martin Vingron, Modeling Amino Acid Replacement, J Comput Biol. 2000;7:761–76. doi: 10.1089/10665270050514918.
[2] Müller T, Spang R, Vingron M. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002;19:8–13. doi: 10.1093/oxfordjournals.molbev.a003985
MMseqs2 Release 7-4e23d
Changes since release 6-f5a1c
New features
- Simplified taxonomy. We add tools the tools to create the taxonomical annotated database
createtaxdb
. It is possible to filter result databaese based on taxonomy withfiltertaxdb
andaddtaxonomy
to append taxonomy information to result databases - index (
createindex
) support for translated target databaes searches - add nucleotide search (experimental)
- support NEON CPU architecture (experimental)
- improve performance of prefilter if L2 is greater 256K
easy-search
automatically computes backtrace if requested by--format-output
- Create search-2m workflow, similiar to 2bLCA but without the LCA computation
- We add a database preload mode. Database preload mode 0: auto, 1:
fread
, 2:mmap
, 3:mmap+touch
. The processing time per query withfread
is 15% faster but the read in is slower.mmap
is use for the MMseqs2 webserver, it enables instance searches if the database is already in memory,mmap+touch
uses mmap an touches every page. - We add a new tool
touchdb
, it loads the database in memory. This can be useuful for"--db-load-mode 2
. - add local hard disks support
--local-tmp
for MPI runs. This reduces pressure from the NFS - Introduce
sortresult
tool to sort an unordered sequence db (e.g. frommergeresult
) prefilter
supports now indexes with k-mer ranges > 2^31convertkb
can read multiple files- speed up
mmap
memory touch function
breaking changes
- new index version. Recomputation of old indexes in needed
--format-output
is now comma separated- changed taxonomy database format, old taxonomy databaes are not supported anymore
default parameter change
extractorfs
default is now--orf-start-mode 1
. This is important for translated searches in organisms with introns.
Bug fixes
- Fix wrong alignment positions for translated searches
- Fix of by one error in
extratalignedregion
- Fix bug in NcbiTaxonomy tool
- Fix e-value threshold if -e < --e-profile
Developer
- Update to newest ALP version
MMseqs2 Release 6-f5a1c
Changes since release 5-9375b
New features
- Support user defined output format in
convertalis
. - Add parameters for gap open and gap extension costs.
- Improve substitution matrix support. Letters of alphabet can now be chose freely.
- Add a few PAM matrices to the data folder. Chose them with the
--sub-mat
parameter. - Support IUPAC codes in translated search.
- Add parameter to define a spaced k-mer pattern.
- Add a new module
ungappedprefilter
. It computes an optimal ungapped score using a vectorized algorithm.
Bug fixes
- Fix
easy-linclust
parameter parsing issue. - Fix coverage filtering in
align
when the parameter--realign
is set. - Fix sequence identity computation in
rescorediagonal --rescore-mode 2
. - Fix
apply
MPI support. - Fix representative sequence output bug in
result2repseq
. - Fix possible MPI issues in modules creating symlinks.
- Fix slightly wrong E-value computed in
alignall
module.
Known Issues
easy-search
output has only one column. Workaround: Add parameter--format-output ""
.