Skip to content

Commit

Permalink
Add Quick Start to README
Browse files Browse the repository at this point in the history
  • Loading branch information
mheinzinger authored Jan 31, 2023
1 parent 3f48cc2 commit b9e9bb5
Showing 1 changed file with 44 additions and 0 deletions.
44 changes: 44 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ This repository will be updated regulary with **new pre-trained models for prote
Table of Contents
=================
* [ ⌛️  News](#news)
* [ 🚀  Quick Start](#quick)
* [ ⌛️  Models Availability](#models)
* [ ⌛️  Dataset Availability](#datasets)
* [ 🚀  Usage ](#usage)
Expand Down Expand Up @@ -47,6 +48,49 @@ Table of Contents
## ⌛️  News
* 2022/11/18: Availability: [LambdaPP](https://embed.predictprotein.org/) offers a simple web-service to access ProtT5-based predictions and UniProt now offers to download [pre-computed ProtT5 embeddings](https://www.uniprot.org/help/embeddings) for a subset of selected organisms.

<a name="quick"></a>
## 🚀&nbsp; Quick Start
Example for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as [colab](https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing):
```python
from transformers import T5Tokenizer, T5EncoderModel
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False).to(device)

# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences as a list
sequence_examples = ["PRTEINO", "SEQWENCE"]

# replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest")

input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
embedding_rpr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7])
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
```



<a name="models"></a>
## ⌛️&nbsp; Models Availability
Expand Down

0 comments on commit b9e9bb5

Please sign in to comment.