Skip to content

Multiomics-Analytics-Group/course_protein_language_modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein Language Modeling Course

logo_small


  1. Topics
  2. Course Structure
  3. References
  4. Resources

Course Structure

Theory Link Data Models
Topics slides
Hands-on
Sequence Analysis Seq. Analysis Notebook Open In Colab Exploring protein sequence and structure data
Fine-tuning a model Model Training Notebook Open In Colab Taking an existing model and tuning it for other prediction/classification tasks Evolutionary Scale Modeling (ESM)
Working with Embeddings Embeddings Notebook Open In Colab Accessing Protein representations (embeddings) generated by existing LMs ProTrans
Predictions pML Predictions Notebook Open In Colab Using embeddings for predicting features or classifying sequences ProTrans
Protein Design Protein Design NotebookOpen In Colab De novo protein design and engineering using a LLM ProtGPT2, ESM

References


  1. Neural Machine Translation by Jointly Learning to Align and Translate Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

  2. Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

  3. MSA Transformer Roshan Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu, Alexander Rives

  4. Transformer-based deep learning for predicting protein properties in the life sciences Abel Chandra, Laura Tünnermann, Tommy Löfstedt, Regina Gratz

  5. BERTology Meets Biology: Interpreting Attention in Protein Language Models Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani

  6. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, David Baker

  7. Large language models generate functional protein sequences across diverse families Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos Jr., Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser & Nikhil Naik

  8. Learning functional properties of proteins with language modelsSerbulent Unsal, Heval Atas, Muammer Albayrak, Kemal Turhan, Aybar C. Acar & Tunca Doğan

  9. Learning the protein language: Evolution, structure, and function Tristan Bepler, Bonnie Berger

  10. The language of proteins: NLP, machine learning & protein sequences Dan Ofer, Nadav Brandes, Michal Linial

  11. Evolutionary-scale prediction of atomic-level protein structure with a language model ZEMING LIN, HALIL AKIN, ROSHAN RAO, BRIAN HIE, ZHONGKAI ZHU, WENTING LU, NIKITA SMETANIN, ROBERT VERKUIL, ORI KABELI, ..., ALEXANDER RIVES

  12. Generative power of a protein language model trained on multiple sequence alignments Damiano Sgarbossa, Umberto Lupo, Anne-Florence Bitbol

  13. How Huge Protein Language Models Could Disrupt Structural Biology

  14. Embeddings from protein language models predict conservation and variant effects Céline Marquet, Michael Heinzinger, Tobias Olenyi, Christian Dallago, Kyra Erckert, Michael Bernhofer, Dmitrii Nechaev & Burkhard Rost

  15. Collectively encoding protein properties enriches protein language models Jingmin An & Xiaogang Weng

  16. ProGen: Language Modeling for Protein Generation Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher

  17. Transformer protein language models are unsupervised structure learners Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, Alexander Rives

  18. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Alexander Rives, Joshua Meier, Tom Sercu and Rob Fergus

  19. NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning Magnus Haraldson Høie, Erik Nicolas Kiehl, Bent Petersen, Morten Nielsen, Ole Winther, Henrik Nielsen, Jeppe Hallgren, Paolo Marcatili

  20. Modeling Protein Using Large-scale Pretrain Language Model Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang github

  21. Deciphering antibody affinity maturation with language models and weakly supervised learning Jeffrey A. Ruffolo, Jeffrey J. Gray, Jeremias Sulam github

  22. Protein embeddings improve phage-host interaction prediction Mark Edward M. Gonzales, Jennifer C. Ureta, View ORCID ProfileAnish M.S. Shrestha github

  23. ProteinBERT: a universal deep-learning model of protein sequence and function Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial github

  24. ProtGPT2 is a deep unsupervised language model for protein design Noelia Ferruz, Steffen Schmidt & Birte Höcker Hugging Face

  25. Protein-Protein Interaction Prediction is Achievable with Large Language Models Logan Hallee, Jason P. Gleghorn

  26. Accurate prediction of virus-host protein-protein interactions via a Siamese neural network using deep protein sequence embeddings Sumit Madan, Victoria Demina, Marcus Stapf, Oliver Ernst, Holger Fröhlich

  27. Structure-informed Language Models Are Protein Designers Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei YE, Quanquan Gu

  28. Graph-BERT and language model-based framework for protein–protein interaction identification Kanchan Jha, Sourav Karmakar & Sriparna Saha

  29. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost

  30. Contrastive learning in protein language space predicts interactions between drugs and protein targets Rohit Singh, Samuel Sledzieski, Bryan Bryson, Bonnie Berger

  31. De novo design of protein structure and function with RFdiffusion Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Sergey Ovchinnikov, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek & David Baker

  32. Single-sequence protein structure prediction using a language model and deep learning Ratul Chowdhury, Nazim Bouatta, Surojit Biswas, Christina Floristean, Anant Kharkar, Koushik Roy, Charlotte Rochereau, Gustaf Ahdritz, Joanna Zhang, George M. Church, Peter K. Sorger & Mohammed AlQuraishi

  33. Before and after AlphaFold2: An overview of protein structure prediction Letícia M. F. Bertoline, Angélica N. Lima, Jose E. Krieger, and Samantha K. Teixeira

  34. Evaluating Protein Transfer Learning with TAPE Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song

  35. FLIP: Benchmark tasks in fitness landscape inference for proteins Christian Dallago, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, Kevin K. Yang

  36. Standards, tooling and benchmarks to probe representation learning on proteins Joaquin Gomez Sanchez, Sebastian Franz, Michael Heinzinger, Burkhard Rost, Christian Dallago

  37. Learning meaningful representations of protein sequences Nicki Skafte Detlefsen, Søren Hauberg & Wouter Boomsma

  38. Language modelling for biological sequences – curated datasets and baselines Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Ole Winther, Henrik Nielsen

  39. Language models generalize beyond natural proteins Robert Verkuil, Ori Kabeli, Yilun Du, Basile I. M. Wicky, Lukas F. Milles, Justas Dauparas, David Baker, Sergey Ovchinnikov, Tom Sercu, Alexander Rives

  40. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar

  41. SETH predicts nuances of residue disorder from protein embeddings Dagmar Ilzhöfer, Michael Heinzinger, Burkhard Rost

  42. Exploiting pretrained biochemical language models for targeted drug design Gökçe Uludoğan, Elif Ozkirimli, Kutlu O Ulgen, Nilgün Karalı, Arzucan Özgür

Other Resources