Skip to content

Releases: steineggerlab/foldseek

Foldseek Release 10-941cd33

19 Jan 13:29
Compare
Choose a tag to compare

Foldseek Release 10

Foldseek introduces GPU support for monomer and multimer search, improved profile search and ProstT5 integration, new databases, several performance improvements and bug fixes.

Major Features

  • GPU Support for Search and Multimers
    Both easy-search and easy-multimersearch now support accelerated searches on the GPU execution. Use --gpu 1 to enable GPU mode and CUDA_VISIBLE_DEVICES to control the number of GPUs. (#391, #411). GPU-enabled binaries require glibc >= 2.17, NVIDIA driver >= 525.60.13, and a Turing or newer GPU. On a single 4090 GPU, searches are 4x faster, and on eight GPUs, they are up to 37x faster than a 128-core CPU using the k-mer prefilter. For more details, see our preprint.
  • Improved Structural Profile Search
    Alignment results can now be converted into position-specific scoring profiles with result2profile, enabling the creation of structural protein family representations. (#411)
  • Enhanced ProstT5 Integration
    Multi-GPU and Apple Metal support added, Improved handling of large input sequence through splitting and switched backend to llama.cpp for better compatibility and performance. (#391)
  • New Databases
    Introduced BFVD as a new virus-specific Foldseek database. (#344).
    For more details about the database, check out the BFVD paper.
  • Improved Multimer Search Workflows
    Optimized multimer workflows for improved speed and reliability, with contributions by @Woosub-Kim. For more details, see our Foldseek-Multimer preprint.
  • Clustering Multimers (First Version)
    Introduced experimental multimer clustering (easy-multimercluster) by @sooyoung-cha and @rachelse, supporting clustering by interface LDDT, chain TM-score, and complex coverage. See filtercomplex for more details.

Breaking Changes

  • Results may differ as masking of letters repeated six or more times is now enabled by default --mask-n-repeat. Disable this option to reproduce previous results.

Other Features

  • Improved Compatibility with MMseqs2 Modules: createsubdb, makepaddedseqdb, and result2profile now work seamlessly with Foldseek databases.
  • Taxonomy Reports in easy-search: Added options to generate taxonomy reports directly within easy-search. (#389)
  • Residue Mapping Rework: Residue mapping has been reworked to combine most gemmi amino acids with previous Foldseek amino acids. (#387)

Bug Fixes

  • Fix order dependent --format-output issue of qtmaln,ttmaln,lddt,u,t (b40729c)
  • Fix clustering of structures without Cα information (0d8d966)

Full Changelog

View the full changelog: 9-427df8a...10-941cd33.

9-427df8a

09 May 17:45
427df8a
Compare
Choose a tag to compare

At a glance: Foldseek release 9 features the fully benchmarked Foldseek-multimer search and structure-based sequence search using ProstT5. Both Foldseek-multimer and structure-based sequence search are also available in the Foldseek webserver.

Major Features

  • Foldseek-multimer: Fully benchmarked and integrated into this release with the easy-multimersearch and multimer workflows (Thanks @Woosub-Kim). Check out our preprint explaining the algorithm.
    Read more on how to get started in our README.
  • Search requires less memory: We optimized the memory consumption of Foldseek. It requires significant less memory now (f629bbe)
  • Structure-based sequence search: Predict protein 3Di directly from amino acid sequences without the need for existing protein structures. This is roughly 400-4000x faster than predicting full protein structures with ColabFold. This feature uses the ProstT5 protein language model and runs by default on CPU:
foldseek databases ProstT5 weights tmp
foldseek databases PDB pdb tmp
foldseek easy-search QUERY.fasta pdb result.m8 tmp --prostt5-model weights

Fast inference using GPU/CUDA is also supported. Compile from source with cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=1 -DCUDAToolkit_ROOT=Path-To-Cuda-Toolkit and call with createdb/easy-search --prostt5-model weights --gpu 1.
(Thanks @Victor-Mihaila).

Breaking changes

  • Remove .cif/.pdb from filenames and remove _MODEL_ from identifiers in .lookup #261 (Thanks @ChaSooyoung)
  • Removed --tar-include and --tar-exclude from createdb as they were unused (15c0516)
  • Not-breaking: workflows using easy-complexsearch and complexsearch will continue to work. These are hidden modules mapping to easy-multimersearch and multimersearch internally. However, the internals have had major changes since the last release.

Other features

  • convert2pdb can output separate PDB files (346c1dd)
  • createdb learned to read a large number of input files from a .tsv file (e1394aa)
  • Force input format with createdb --input-format (852434a)
  • Compute exact TM-score with --exact-tmscore (493cefe)
  • Added CATH50 database (6893dcc)
  • Update HTML output (not fully supported for multimer yet; c7e4a37, 361c22a, 1bc8d2e; Thanks @gamcil)
  • compressca learned new input and output modes (8e68e86, 5d2724d, 284bc81)

Bug Fixes

  • Fix broken symlinks with databases PDB download (9ef6d18, fa6c530).
  • Fix AFDB Proteome and SwissProt download check (fa6c530, Thanks @TigerWindWood)
  • Fix AF3 mmCIF files crashing createdb
  • Fix convert2pdb creating broken PDB files for large structures (b6dac8a)
  • Remove ligand and alt res within chain #198 (Thanks @NatureGeorge)
  • Skip residues without C-alpha #214 (75a50f7)
  • structurerescorediagonal did not properly respect --tmscore-threshold (#205; 886021d)
  • Fallback alignment to Smith-Waterman when block-aligner produces invalid alignments (54c271c)

Developers

  • Foldseek now includes the Candle ML framework and has a further expanded Rust codebase.
  • Foldseek can be inherited from to create subprojects (e00a3dc, 7c2c08e, 9a1a087, 00d2033)

8-ef4e960

11 Sep 04:45
946841f
Compare
Choose a tag to compare

At a glance: Added support for clustered, protein-complex searches (alpha-verison, feedback welcome) as well as improved HTML output.

Features

  • Implement easy-complex-search to find similar complexes structures in a database
  • Implement a cluster search --cluster-search 1, which speeds up searches through redundant databases. It first searches only the representatives and then expands the final alignment to all cluster members. Two downloadable DBs support this search: PDB and the Alphafold/UniProt50.
  • createclusearchdb allows to build a searchable cluster database (b4d7ec5)
  • convertalis HTML output updated to match search.foldseek.com output (96be67c)
  • Introduced Alphafold/UniProt50-minimal and updated cluster file downloads for regular Alphafold/UniProt50 to support cluster searches (93ad1d4, 2e9da41, daad5ab)
  • We added two modules scorecomplex and createcomplexreport to compute a TMscore between protein complex as well as to summarize the findings (938b591, a6c75cb)

Bug fixes

  • Foldseek correctly computes coverage again (c63725d). Coverage computation was broken since release 6 (29979fb).
  • --alignment-type 0 (3Di-only) now correctly ignores amino-acid information (f0de872)
  • createdb could miss some files when recursively looking within directories on some file systems (d1d1b86)
  • convertalis --format-output can output qca or tca if only one of the two databases has C-alpha information (311845d)
  • --lddt-thr and --tmscore-thrare ignored when--sort-by-structure-bits 0` is set (b1b4710)

Developers

  • Much smaller precomputed index for --prefilter-mode 0 (exhaustive ungapped prefiltering) with --index-exclude 1 or --sort-by-structure-bits 0 (No C-alpha) with --index-exclude 2 or both with --index-exclude 3 (8f586c0)
  • Enabled WebAssembly (WASM) compilation for Foldseek (408cfae; pending on Daniel-Liu-c0deb0t/block-aligner#26)

Others

7-04e0ec8

15 Jun 03:29
Compare
Choose a tag to compare

At a glance: Downloadable pdb database can be searched with --cluster-search 1. Many createdb improvements and other bug fixes.

Features

  • createdb properly warns and exits if no protein chain can be extracted (a146142, #134)
  • createdb separates PDB/mmCIF MODEL records into different source/lookup entries (d488f4a)
  • createdb filters out structures that are not proteins (d48d389)
  • databases downloader supports cluster databases (ef768f4)
  • pdb database creation script has been updated to produce a cluster database that can be searched with --cluster-search 1 (8eb36a2)

Bug fixes

  • Fixed a bug with block-aligner where long protein sequences would error out (0627447) Thanks @Daniel-Liu-c0deb0t!
  • Foldseek can be compiled without zlib, fixed an issue with zlib linking to gemmi (0832bef, 1a038db)
  • Fixed Dockerfile to drop backports as its not needed with Debian bookworm (04e0ec8)

Others

  • Made compressca an expert tool, hiding it from the default view to avoid confusion. (e4fe5be)

Developers

Foldseek 6-29e2557

08 May 16:44
Compare
Choose a tag to compare

At a glance: Introduced block-aligner for faster alignments, added ungapped prefilter mode, added cluster search support

Major Features

  • Introduced block-aligner, a new banded-alignment algorithm that speeds up alignments by ~2x. Check out the block-aligner preprint. Thanks @Daniel-Liu-c0deb0t!
  • Added ungapped prefilter mode (--prefilter-mode 1). This is similar to the HHblits prefilter that exhaustively aligns without gaps all queries and targets. This mode has much lower memory requirements and should scale better for single or few query searches. However, it scales worse with many queries.
  • Added cluster search support, similar to the search introduced in ColabFold

Features

  • Improved README
  • Added support for qtmscore and ttmscore in convertalis --format-output
  • LDDT computation is now faster

Bug fixes

  • --greedy-best-hit search mode is now correctly exposed. Thanks @Pooryamb!
  • Removed ANISOU parsing of PDB
  • Added missing Foldseek specific convertalis --format-output options to help text

Developers and Maintainers

Foldseek now requires Rust to compile. Please make sure Rust 1.68 or newer is installed, as we have observed issues with 1.64. You can pass -DIGNORE_RUST_VERSION=1 to CMake to ignore the check. Please ensure the Foldseek regression test in ./regression/run_regression.sh passes before shipping Foldseek packages. We also require at least CMake 3.15 now.

Foldseek 5-53465f0

10 Feb 06:40
Compare
Choose a tag to compare

At a glace: Default enabled compressed C-alpha much decrease resource consumption of large databases. Otherwise, mostly house keeping in this release.

Features

  • Compressed C-alpha coordinates are now enabled by default
  • Foldseek now deals correctly with modified amino acids and HETATMS
  • Exhaustive search mode that skips prefiltering with --exhaustive-search 1
  • TM-align speed up by replacing score_fun8_standard

Bug fixes

  • Disable gap-specific profiles for structure alignments
  • C-alpha coordinates were not correctly preloaded in the alignment stage
  • Reciprocal best hit search now disables new scoring and compositional bias correction for consistent scores in both directions
  • Fixed various bugs around compressed C-alpha coordinates
  • Computed RMSD was wrong
  • Load the DB in memory before aligning (structurealign performance issue)
  • Alignment now uses the correct --comp-bias-corr-scale
  • Fix crash with highly compositionally biased sequences

Foldseek 4-645b789

28 Dec 14:12
Compare
Choose a tag to compare

Release at a glance: better hit ranking, critical bug fix, structure clustering, smaller database size and updated AlphaFold Databases.

Features

  • foldseek databases now offers the AlphaFoldDB v4 databases.
  • We have improved hit ranking in Foldseek by multiplying the 3Di/AA bit-score by the geometric mean of alignment LDDT and TMscore, resulting in more accurate rankings.
  • The --format-output prob parameter now returns the probability of homology.
  • The --format-mode 5 flag generates PDB files with all Cα atoms superimposed based on the aligned coordinates onto the query structure.
  • We have added a faster computation for LDDT, available with the --format-output lddt,lddtfull flag. The lddt flag outputs the average LDDT score for all Cα, while the lddtfull flag outputs a string of LDDT scores for each Cα.
  • The --coord-store-mode 2 parameter allows for storage of C-alpha lossless in compressed format.
  • TMalign mode (--alignment-type 1) now uses the 3Di/AA as a prefilter to improve the precision and recall of TMalign, this also makes the TMalign mode much faster.
  • We have added support for reading in Foldcomp databases (see foldcomp.foldseek.com).
  • The database module now includes an option to download ESMAtlas30.
  • We have added support for easy-cluster, a tool to cluster structural datasets using 3Di/AA alignment, LDDT, and TMscore.
  • We have added support for profile searches as well as iterative searches using the --num-iterations flag.
  • TMalign results can now be sorted by qTM, tTM, min(qTM, tTM), max(qTM, tTM), and avg(qTM, tTM) using the --sort flag.
  • New modulecompressca: converts an uncompressed Cα database to compressed format.
  • New module convert2pdb: converts a Foldseek structure database to a multi-model PDB file.
  • We added our PDB100 update pipeline to util/update_webserver_pdb

Breaking Change

  • 3Di/AA score reported by Foldseek is now bit-score * sqrt(alignment LDDT * alignment TMscore)
  • Default sort of TMalign is now average avg(qTM,tTM).
  • We do not provide the "Alphafold/UniProt-NO-CA" database anymore, Cα databases are now always required.
  • AlphaFoldDB Swiss-Prot and Proteome file names have changed. Downloads for these will stop working on Foldseek versions before this one. Generally, since the Cα database format has changed and is incompatible to older Foldseek versions. None of the v4 databases will work with previous versions.
  • The default E-value is now 10.

Bug fixes

  • We have fixed an issue that resulted in the loss of high-scoring diagonals during the prefilter step.
  • The visualization has been fixed for cases where the alignment length is exactly 80.
  • We have fixed issues with tar inputs.

Foldseek 3-915ef7d

01 Aug 13:12
Compare
Choose a tag to compare

Features

You can choose between Alphafold/UniProt, Alphafold/UniProt-NO-CA and Alphafold/UniProt50:
Alphafold/UniProt: Contains all 214 million entries from the AlphaFold UniProt database, including C-alpha. This database is ~700GB large to download and ~950GB after extraction.
Alphafold/UniProt-NO-CA: Excludes C-alphas and is much smaller (~70GB download, ~170GB extracted). However, TM-align based alignments do not work (search --alignment-type 1, tmalign, and convertalis --format-output alntmscore,u,t).
Alphafold/UniProt50: Alphafold/UniProt clustered with MMseqs2 to 50% sequence identity and 80% bidirectional coverage (~190GB download). We offer this database in the web server at https://search.foldseek.com.

  • Added databases TSV output
  • createdb supports downloading structures from Google Cloud Storage. Not enabled by default, see user guide on how to compile Foldseek with GCS support
  • PDB offered through databases will be updated regularly. Thanks to @jaylee2000

Known issues

  • prefilter against large databases such as the AlphaFold Uniprot Protein Structure Database is executed with 6-mers (-k 6). This is less efficient than 7-mers. We will optimize 7-mer parameters in a future release and re-enable automatic k-mer size choice

Bug fixes

  • Fixed PDB download

Foldseek 2-8bd520

08 Jul 16:32
8bd5201
Compare
Choose a tag to compare

Features

  • implemented reciprocal-best-structure-hit search (rbh and easy-rbh) similar to Monzon et al. preprint
  • C-alpha only structures are supported as input (backbone is completed using pulchra)
  • convertalis can output a HTML based result viz (--format-mode 3)
    Example: foldseek easy-search example/d1asha_ example/ aln.html tmp --format-mode 3

  • add support to read structures from tar and tar.gz in createdb, easy-search and easy-rbh.
    Example: foldseek easy-rbh UP000005640_9606_HUMAN_v2.tar UP000001940_6239_CAEEL_v2.tar rbh tmp --tar-include '.*pdb'
  • convertalis can output C-alpha, TMscore, TM rotation matrices (--format-output qca,tca,alntmscore,u,t respectively)
foldseek easy-search example/ example/ aln tmp --format-output query,target,alntmscore,u,t
cat aln
d2gdma_ d2gdma_ 1.000E+00       1.000,-0.000,0.000,0.000,1.000,0.000,-0.000,-0.000,1.000        -0.000,-0.000,0.000
d2gdma_ d1q1fa_ 7.971E-01       0.299,-0.746,-0.595,0.952,0.192,0.237,-0.062,-0.638,0.768       94.039,-63.738,34.804
d2gdma_ d1cqxa1 6.794E-01       0.694,-0.662,0.283,0.570,0.746,0.345,-0.439,-0.078,0.895        7.534,-93.168,-12.301
  • introduce --alt-ali to compute additional sub-optimal alignments for a query-target pairs #12
  • added Foldseek docker image (supports linux/amd64 and linux/arm64)

Bug fixes

  • fix invalid database type check in aln2tmscore #14
  • aln2tmscore could get stuck in an infinite loop #17
  • default sensitivity is now -s 9.5 and --max-seqs 1000

Foldseek Release 1-3c64211

09 Feb 17:31
Compare
Choose a tag to compare

First release of Foldseek

Foldseek enables fast and sensitive comparisons of large structure sets. It reaches sensitivities similar to state-of-the-art structural aligners while being at least 20,000 times faster.

Publications

van Kempen M, Kim S, Tumescheit C, Mirdita M, Söding J, and Steinegger M. Foldseek: fast and accurate protein structure search. bioRxiv, doi:10.1101/2022.02.07.479398 (2021)

Webserver

Search your protein structures against the AlphaFoldDB and PDB in seconds using our Foldseek webserver:

🚀search.foldseek.com