GitHub - florisvdf/chargenet: Predicting protein variant properties using electrostatic representations.

ChargeNet

Predicting protein variant properties using electrostatic representations

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Usage
License
Contact

About The Project

This repository provides the full implementation of the ChargeNet pipeline described in the preprint Predicting protein variant properties with electrostatic representations. While the original experiments were executed on a proprietary commercial AWS SageMaker account and therefore cannot be shared directly, all code and configuration necessary to reproduce the experiments independently are included here.

Getting Started

Prerequisites

ChargeNet relies on two important tools:

A mutagenesis tool. This can be either foldx or pymol.
apbs, software for continuum electrostatics computation.

foldx provides free academic licenses, and pymol can be installed through the schrodinger anaconda channel with:

conda install -c conda-forge -c schrodinger pymol-bundle

or built from source by following the installation guide.

If you are using foldx, ensure that there is an executable named foldx added to your PATH environment variable and that you have an environment variable named ROTABASE_LOCATION storing a path pointing to a rotabase file named rotabase.txt.

Precompiled binaries of apbs can be found at https://github.com/Electrostatics/apbs/releases. Ensure that you also have apbs added to PATH.

Installation

Clone the repo

git clone https://github.com/florisvdf/chargenet.git

Install
```
cd chargenet
```
Using uv
```
uv sync
```
Using pip
```
pip install .
```

Usage

import pandas as pd
from chargenet.pipelines import ChargeNet

# ChargeNet assumes a column named "split" is present with column values "train", "valid" and "test", as well as a column storing the target to predict.
df = pd.read_csv("path/to/my_dataframe.csv")
pipeline = ChargeNet(
   pdb_file_path="path/to/my_structure.pdb",
   reference_sequence="MYREFERENCESEQWENCE",
   target="my_target",
)
# Fit ChargeNet
pipeline.run(df)
# Make predictions
predictions = pipeline.predict(df)

Important

ChargeNet stores electrostatic representations in an ElectrostaticDataset object in order to not recompute these representations during inference. This means that one must pass a dataframe to predict that is identical to the one passed to run.

Tip

It is recommended to run ChargeNet with as many cores as possible by setting the n_cores parameter. This is because the computation of electrostatics is expensive. Computing electrostatics for a dataset with 3000 variants of 400 residues takes around an hour when using 60 cores of an AWS sagemaker ml.g5 instance. If you plan on running multiple training experiments on the same dataset with the same configuration of computing electrostatics, consider saving precomputed electrostatics of your dataset by setting write_electrostatics_path. This way, you can later load these again by setting electrostatics_path, avoiding recomputation.

ChargeNet Parameters

Parameter	Type	Default	Description
pdb_file_path	str	na	Path to a `.pdb` file containing a structure matching the reference sequence.
reference_sequence	str	na	Amino acid sequence of the reference as a single string of single letter amino acid codes.
target	str	na	Name of the column in the training data dataframe that stores the target values.
mutagenesis_tool	str	foldx	Mutagenesis tool to use. Options: foldx, pymol.
use_ph	int	0	Compute electrostatics at a pH value stored in a optional column named "ph". 0: no, 1: yes.
use_temp	int	0	Compute electrostatics at a temperate in Kelvin stored in a optional column named "temperature". 0: no, 1: yes.
channel_configuration	str	all	Which electrostatic quantities to compute as input channels to the 3D CNN. Options: charge, charge_density, potential, charge_and_charge_density, charge_and_potential, charge_density_and_potential, all.
resolution	float	1.5	Resolution in angstrom at which the electrostatic quantities are computed. The inputs to the 3D CNN will have voxel edge lengths of this value.
out_channels	int	13	Number of 3D Conv layer output channels.
n_blocks	int	1	Number of 3D Conv residual blocks.
kernel_edge_length	int	3	Length of all edges of the 3D Conv layer kernel.
pooling_edge_length	int	3	Length of all edges of the 3D max pooling layer.
dropout_rate	float	0.30	Dropout value of the dropout layer prior to the final dense layer of the 3D CNN.
batch_size	int	20	Number of samples passed to the 3D CNN in a single forward pass.
epochs	int	300	Number of training iterations.
learning_rate	float	1e-4	Learning rate passed to the optimizer.
loss	str	mse	Loss function for training the 3D CNN. Options: mse, mae, huber.
optimizer	str	adam	3D CNN optimizer. Options: adam, adamw.
weight_decay	float	5e-6	Optimizer weight decay.
patience	int	20	Early stopping criterion. If loss doesn't improve after this many iterations, training will stop and best weights will be restored.
device	str	cpu	Device for training and inference. Options: cpu, gpu.
n_cores	int	4	Number of cores for performing mutagenesis and electrostatics computation.
intermediate_data_path	str	None	Data path for storing intermediate outputs (mutated structure, pqr files, etc). By default, a temporary directory will be created.
electrostatics_path	str	None	Path to precomputed electrostatics matching the index of the samples in the dataframe.
write_electrostatics_path	str	None	Path to which computed electrostatics should be written as a `.h5` file.

License

All original software developed for this project is licensed under the MIT License; you may not use this file except in compliance with the MIT license. You may obtain a copy of the MIT license at: https://mit-license.org

This repository includes third party software, with source code, binary distribution and their corresponding license files located in THIRD_PARTY_SOFTWARE and THIRD_PARTY_LICENSES, respectively.

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the MIT license are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Contact

Floris van der Flier - @florisvdf - floris.vanderflier@wur.nl

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
THIRD_PARTY_LICENSES		THIRD_PARTY_LICENSES
THIRD_PARTY_SOFTWARE		THIRD_PARTY_SOFTWARE
data		data
img		img
results		results
scripts		scripts
src/chargenet		src/chargenet
tests		tests
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChargeNet

About The Project

Getting Started

Prerequisites

Installation

Usage

ChargeNet Parameters

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

florisvdf/chargenet

Folders and files

Latest commit

History

Repository files navigation

ChargeNet

About The Project

Getting Started

Prerequisites

Installation

Usage

ChargeNet Parameters

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages