Skip to content

allydunham/proteinnetpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProteinNetPy 1.0.1

DOI Documentation Status

A python library for working with ProteinNet text data, allowing you to easily load, stream and filter data, map functions across records and produce TensorFlow datasets. For details of the dataset see the ProteinNet Bioinformatics paper. Documentation for all functions of the module is available here.

Install

pip install proteinnetpy

Or install the development version from Github:

pip install git+https://github.com:allydunham/proteinnetpy

Requirements

  • Python 3
  • Numpy
  • Biopython
  • TensorFlow (if using the datasets module)

Basic Usage

The main object used in ProteinNetPy is the ProteinNetRecord, which allows access to the various record fields and methods for common manipulations, such as calculating a one-hot sequence representation or residue distance matrix. It also supports most applicable operations like len, str etc. While the parser module contains a generator to parse files, it is generally easier to use the ProteinNetDataset class from the data module:

from proteinnetpy.data import ProteinNetDataset
data = ProteinNetDataset(path="path/to/proteinnet")

This class includes a preload argument, which determines if the dataset is loaded into memory or streamed. It also supports filtering using the filter_func argument, which is passed a function that returns truthy values for a record to determine if it is kept in the dataset. A range of common filters are included in the data module, as well as combine_filters(), which can applies all passed filters to each record.

Once a dataset has been loaded it can be iterated over to process data. The ProteinNetMap class creates map objects that map a function over the dataset, including options to stream the map on each iteration or pre-calculate results. They have a generate method that creates a generator object yielding the output of the function. The LabeledFunction class is provided to create functions annotated with output types and shapes, used for automatically creating TensorFlow datasets. The mutation module provides some example functions that return mutated records.

The following example code shows a typical simple usage, creating a streamed TensorFlow dataset from ProteinNet data:

from proteinnetpy import data
from proteinnetpy import tfdataset

class MapFunction(data.LabeledFunction):
    """
    Example ProteinNetMap function outputting a one-hot sequence and contact graph input data
    and multiple alignment PSSM labels
    """
    def __init__(self):
        self.output_shapes = (([None, 20], [None, None]), [None, 20])
        self.output_types = (('float32', 'float32'), 'int32')

    def __call__(self, record):
        return (record.get_one_hot_sequence().T, record.distance_matrix()), record.evolutionary.T

filter_func = data.make_length_filter(min_length=32, max_length=2000)
data = data.ProteinNetDataset(path="path/to/proteinnet", preload=False)
pn_map = data.ProteinNetMap(data, map=MapFunction(), static=False, filter_errors=True)

tf_dataset = tfdataset.proteinnet_tf_dataset(pn_map, batch_size=100, prefetch=400, shuffle_buffer=200)

Many more functions, arguments and uses are available, with detailed descriptions currently found in docstrings. Full documentation will be generated from these for a future release.

Scripts

The package also provides convenience scripts for processing ProteinNet datasets:

  • add_angles_to_proteinnet - Add extra fields to a ProteinNet file with φ, ψ and χ backbone/torsion angles
  • proteinnet_to_fasta - Extract a fasta file with the sequences from a ProteinNet file
  • filter_proteinnet - Filter a ProteinNet file to include/exclude records from a list of IDs

Detailed usage instructions for each can be found using the -h argument.