A python library for working with ProteinNet text data, allowing you to easily load, stream and filter data, map functions across records and produce TensorFlow datasets. For details of the dataset see the ProteinNet Bioinformatics paper. Documentation for all functions of the module is available here.
pip install proteinnetpy
Or install the development version from Github:
pip install git+https://github.com:allydunham/proteinnetpy
- Python 3
- Numpy
- Biopython
- TensorFlow (if using the
datasets
module)
The main object used in ProteinNetPy is the ProteinNetRecord, which allows access to the various record fields and methods for common manipulations, such as calculating a one-hot sequence representation or residue distance matrix.
It also supports most applicable operations like len
, str
etc.
While the parser
module contains a generator to parse files, it is generally easier to use the ProteinNetDataset
class from the data module:
from proteinnetpy.data import ProteinNetDataset
data = ProteinNetDataset(path="path/to/proteinnet")
This class includes a preload
argument, which determines if the dataset is loaded into memory or streamed.
It also supports filtering using the filter_func
argument, which is passed a function that returns truthy values for a record to determine if it is kept in the dataset.
A range of common filters are included in the data module, as well as combine_filters()
, which can applies all passed filters to each record.
Once a dataset has been loaded it can be iterated over to process data.
The ProteinNetMap
class creates map objects that map a function over the dataset, including options to stream the map on each iteration or pre-calculate results.
They have a generate
method that creates a generator object yielding the output of the function.
The LabeledFunction
class is provided to create functions annotated with output types and shapes, used for automatically creating TensorFlow datasets.
The mutation
module provides some example functions that return mutated records.
The following example code shows a typical simple usage, creating a streamed TensorFlow dataset from ProteinNet data:
from proteinnetpy import data
from proteinnetpy import tfdataset
class MapFunction(data.LabeledFunction):
"""
Example ProteinNetMap function outputting a one-hot sequence and contact graph input data
and multiple alignment PSSM labels
"""
def __init__(self):
self.output_shapes = (([None, 20], [None, None]), [None, 20])
self.output_types = (('float32', 'float32'), 'int32')
def __call__(self, record):
return (record.get_one_hot_sequence().T, record.distance_matrix()), record.evolutionary.T
filter_func = data.make_length_filter(min_length=32, max_length=2000)
data = data.ProteinNetDataset(path="path/to/proteinnet", preload=False)
pn_map = data.ProteinNetMap(data, map=MapFunction(), static=False, filter_errors=True)
tf_dataset = tfdataset.proteinnet_tf_dataset(pn_map, batch_size=100, prefetch=400, shuffle_buffer=200)
Many more functions, arguments and uses are available, with detailed descriptions currently found in docstrings. Full documentation will be generated from these for a future release.
The package also provides convenience scripts for processing ProteinNet datasets:
- add_angles_to_proteinnet - Add extra fields to a ProteinNet file with φ, ψ and χ backbone/torsion angles
- proteinnet_to_fasta - Extract a fasta file with the sequences from a ProteinNet file
- filter_proteinnet - Filter a ProteinNet file to include/exclude records from a list of IDs
Detailed usage instructions for each can be found using the -h
argument.