Add Quick Start to README

RSchmirler · Jan 31, 2023 · b9e9bb5 · b9e9bb5
1 parent 3f48cc2
commit b9e9bb5
Showing 1 changed file with 44 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ This repository will be updated regulary with **new pre-trained models for prote
 Table of Contents
 =================
 * [ ⌛️&nbsp; News](#news)
+* [ 🚀&nbsp; Quick Start](#quick)
 * [ ⌛️&nbsp; Models Availability](#models)
 * [ ⌛️&nbsp; Dataset Availability](#datasets)
 * [ 🚀&nbsp; Usage ](#usage)
@@ -47,6 +48,49 @@ Table of Contents
 ## ⌛️&nbsp; News
 * 2022/11/18: Availability: [LambdaPP](https://embed.predictprotein.org/) offers a simple web-service to access ProtT5-based predictions and UniProt now offers to download [pre-computed ProtT5 embeddings](https://www.uniprot.org/help/embeddings) for a subset of selected organisms. 
 
+<a name="quick"></a>
+## 🚀&nbsp; Quick Start
+Example for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as [colab](https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing):
+```python
+from transformers import T5Tokenizer, T5EncoderModel
+import torch
+device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
+
+# Load the tokenizer
+tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False).to(device)
+
+# Load the model
+model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)
+
+# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
+model.full() if device=='cpu' else model.half()
+
+# prepare your protein sequences as a list
+sequence_examples = ["PRTEINO", "SEQWENCE"]
+
+# replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
+sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
+
+# tokenize sequences and pad up to the longest sequence in the batch
+ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest")
+
+input_ids = torch.tensor(ids['input_ids']).to(device)
+attention_mask = torch.tensor(ids['attention_mask']).to(device)
+
+# generate embeddings
+with torch.no_grad():
+    embedding_rpr = model(input_ids=input_ids,attention_mask=attention_mask)
+
+# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7]) 
+emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
+# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])
+emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)
+
+# if you want to derive a single representation (per-protein embedding) for the whole protein
+emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
+```
+
+
 
 <a name="models"></a>
 ## ⌛️&nbsp; Models Availability