Skip to content

Commit

Permalink
Publication version
Browse files Browse the repository at this point in the history
  • Loading branch information
nishanthmerwin committed Jun 7, 2019
0 parents commit 0d8b8c0
Show file tree
Hide file tree
Showing 23 changed files with 2,160 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*.pyc
.idea*
*.egg-info*
test_resources/*
213 changes: 213 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@

# NLPPrecursor


NLPPrecursor is a deep learning framework that analysis protein sequences and predicts their RiPP biosynthetic family along
with a possible cleavage site allowing for the rapid discovery of RiPPs in genomic data. NLPPrecursor fits best with a tool
such as [PRODIGAL](https://github.com/hyattpd/Prodigal), but any ORF finder should suffice.

NLPPrecursor is freely available for use on the DeepRiPP website: <http://deepripp.magarveylab.ca>. The following repository
demonstrates how to use NLPPrecursor in a programatic manner on your own hardware. Caution, this does require you to be comfortable
with Python!

To train NLPPrecursor, you'll need to perform two main steps representative of the two modules.

# Installation

This repo mainly needs two requirements: PyTorch and FastAI.

Install pytorch according to your GPU/CPU preference here: <https://pytorch.org>

fastai is a rapidly developed library, unfortunately that means some of the models within DeepRiPP require older installations of fastai.
In a separate location, clone fastai, switch to a specific commit and install the package:

```bash
git clone https://github.com/fastai/fastai.git
cd fastai
git checkout fee0e6a0b3af033d41d3468df1c138aecb134926
pip install .
```

Finally, you should be ready to install NLPPrecursor using the following command:

```bash
pip install git+https://github.com/magarveylab/nlpprecursor
```


## Example usage for prediction

Current models are available through our [releases](https://github.com/magarveylab/nlpprecursor/releases).

To use them in your analysis, use the following code:


```python

from nlpprecursor.classification.data import DatasetGenerator as CDG
from nlpprecursor.annotation.data import DatasetGenerator as ADG
from pathlib import Path

models_dir = Path("../models") # downloaded from releases!

class_model_dir = models_dir / "classification"
class_model_path = class_model_dir / "model.p"
class_vocab_path = class_model_dir / "vocab.pkl"
annot_model_dir = models_dir / "annotation"
annot_model_path = annot_model_dir / "model.p"
annot_vocab_path = annot_model_dir / "vocab.pkl"

sequences = [
{
"sequence": "MTYERPTLSKAGGFRKTTGLAGGTAKDLLGGHQLI",
"name": "unique_name",
}
]

class_predictions = CDG.predict(class_model_path, class_vocab_path, sequences)
cleavage_predictions = ADG.predict(annot_model_path, annot_vocab_path, sequences)

import json
print("Class predictions")
print(json.dumps(class_predictions, indent=4))

print("Cleavage predictions")
print(json.dumps(cleavage_predictions, indent=4))
```

```json
Class predictions
[
{
"class_predictions": [
{
"class": "LASSO_PEPTIDE",
"score": 0.9999966621398926
}
],
"name": "unique_name"
}
]
Cleavage predictions
[
{
"name": "unique_name",
"cleavage_prediction": {
"sequence": "LAGGTAKDLLGGHQLI",
"start": 19,
"stop": 35,
"score": -19735.994140625,
"name": "unique_name",
"status": "success"
}
}
]
```


## Example training

All training data to build the above models is available through this repo under `training_data`. Training
largely happens in two steps, first the classification model and second the cleavage (also called annotation) model.

In both cases, the training data is randomly stratified into training, validation and test data. During training,
you will see up to date stats based on the validation set loss. And at the end, the model will be evaluated against
the test data set and results will be output into the data_path directory.


#### Classification
The cleavage model is simple to train, and usually takes ~8 hours on a 4 core computer, 8gb of RAM and a K80 NVidia GPU.

```python
from protai.classification.data import DatasetGenerator
from pathlib import Path
import json


def train_opt(lm_json, class_json, data_path):
dg = DatasetGenerator(0.9, lm_json, class_json, data_path, bs=2)
print(dg.stage)
dg._read_jsons()
dg.stage += 1
print(dg.stage)

dg.tokenize()
dg.stage += 1
print(dg.stage)

dg.split_class_data()
dg.stage += 1
print(dg.stage)

dg.train_lm(epochs=1)
dg.stage +=1
print(dg.stage)

dg.train_class(epochs=1)

dg.test_class()


if __name__ == "__main__":
data_path = Path("./training_data/classification")
lm_json = data_path / "lm_data.json"
class_json = data_path / "class_data.json"
train_opt(lm_json, class_json, data_path)

```



#### Cleavage (annotation)

To train the cleavage model, use the following logic to input any train or update the models.

```python

from nlpprecursor.annotation.data import DatasetGenerator
import pickle
import json

def train(data_path, json_path):
# Train split percent
# Training data path
# Save directory
# Batch size
dg = DatasetGenerator(0.9, json_path, data_path, bs=5)
dg.run(1) # Number of epochs


def test(data_path, raw_data_path):

data_path = Path(data_path)
model_path = data_path / "model.p"
vocab_path = data_path / "vocab.pkl"
datasplit_path = data_path / "datasplit.json"

results = DatasetGenerator.evaluate_later(model_path, vocab_path, datasplit_path, raw_data_path)
outpath = Path(data_path) / "tested.json"
with outpath.open("w") as fp:
json.dump(results, fp)

def predict(data_path, sequences):
data_path = Path(data_path)
model_path = data_path / "model.p"
vocab_path = data_path / "vocab.pkl"


if __name__ == "__main__":

data_path = "./training_data/annotation"
json_path = data_path + "/all_props.json"

train(data_path, json_path)
test(data_path, json_path)

```







Empty file added nlpprecursor/__init__.py
Empty file.
Loading

0 comments on commit 0d8b8c0

Please sign in to comment.