Table of Contents
This repository provides the full implementation of the ChargeNet pipeline described in the preprint Predicting protein variant properties with electrostatic representations. While the original experiments were executed on a proprietary commercial AWS SageMaker account and therefore cannot be shared directly, all code and configuration necessary to reproduce the experiments independently are included here.
ChargeNet relies on two important tools:
- A mutagenesis tool. This can be either
foldx
orpymol
. apbs
, software for continuum electrostatics computation.
foldx
provides free academic licenses, and pymol
can be installed through the schrodinger anaconda channel with:
conda install -c conda-forge -c schrodinger pymol-bundle
or built from source by following the installation guide.
If you are using foldx
, ensure that there is an executable named foldx
added to your PATH
environment variable and that you have an environment variable named ROTABASE_LOCATION
storing a path pointing to a rotabase file named rotabase.txt
.
Precompiled binaries of apbs
can be found at https://github.com/Electrostatics/apbs/releases. Ensure that you also have apbs
added to PATH
.
-
Clone the repo
git clone https://github.com/florisvdf/chargenet.git
-
Install
cd chargenet
Using uv
uv sync
Using pip
pip install .
import pandas as pd
from chargenet.pipelines import ChargeNet
# ChargeNet assumes a column named "split" is present with column values "train", "valid" and "test", as well as a column storing the target to predict.
df = pd.read_csv("path/to/my_dataframe.csv")
pipeline = ChargeNet(
pdb_file_path="path/to/my_structure.pdb",
reference_sequence="MYREFERENCESEQWENCE",
target="my_target",
)
# Fit ChargeNet
pipeline.run(df)
# Make predictions
predictions = pipeline.predict(df)
Important
ChargeNet stores electrostatic representations in an ElectrostaticDataset
object in order to not recompute these representations during inference. This means that one must pass a dataframe to predict
that is identical to the one passed to run
.
Tip
It is recommended to run ChargeNet with as many cores as possible by setting the n_cores
parameter. This is because the computation of electrostatics is expensive. Computing electrostatics for a dataset with 3000 variants of 400 residues
takes around an hour when using 60 cores of an AWS sagemaker ml.g5 instance. If you plan on running multiple training experiments on the same dataset with the same configuration of computing electrostatics, consider saving precomputed electrostatics
of your dataset by setting write_electrostatics_path
. This way, you can later load these again by setting electrostatics_path
, avoiding recomputation.
Parameter | Type | Default | Description |
---|---|---|---|
pdb_file_path | str | na | Path to a .pdb file containing a structure matching the reference sequence. |
reference_sequence | str | na | Amino acid sequence of the reference as a single string of single letter amino acid codes. |
target | str | na | Name of the column in the training data dataframe that stores the target values. |
mutagenesis_tool | str | foldx | Mutagenesis tool to use. Options: foldx, pymol. |
use_ph | int | 0 | Compute electrostatics at a pH value stored in a optional column named "ph". 0: no, 1: yes. |
use_temp | int | 0 | Compute electrostatics at a temperate in Kelvin stored in a optional column named "temperature". 0: no, 1: yes. |
channel_configuration | str | all | Which electrostatic quantities to compute as input channels to the 3D CNN. Options: charge, charge_density, potential, charge_and_charge_density, charge_and_potential, charge_density_and_potential, all. |
resolution | float | 1.5 | Resolution in angstrom at which the electrostatic quantities are computed. The inputs to the 3D CNN will have voxel edge lengths of this value. |
out_channels | int | 13 | Number of 3D Conv layer output channels. |
n_blocks | int | 1 | Number of 3D Conv residual blocks. |
kernel_edge_length | int | 3 | Length of all edges of the 3D Conv layer kernel. |
pooling_edge_length | int | 3 | Length of all edges of the 3D max pooling layer. |
dropout_rate | float | 0.30 | Dropout value of the dropout layer prior to the final dense layer of the 3D CNN. |
batch_size | int | 20 | Number of samples passed to the 3D CNN in a single forward pass. |
epochs | int | 300 | Number of training iterations. |
learning_rate | float | 1e-4 | Learning rate passed to the optimizer. |
loss | str | mse | Loss function for training the 3D CNN. Options: mse, mae, huber. |
optimizer | str | adam | 3D CNN optimizer. Options: adam, adamw. |
weight_decay | float | 5e-6 | Optimizer weight decay. |
patience | int | 20 | Early stopping criterion. If loss doesn't improve after this many iterations, training will stop and best weights will be restored. |
device | str | cpu | Device for training and inference. Options: cpu, gpu. |
n_cores | int | 4 | Number of cores for performing mutagenesis and electrostatics computation. |
intermediate_data_path | str | None | Data path for storing intermediate outputs (mutated structure, pqr files, etc). By default, a temporary directory will be created. |
electrostatics_path | str | None | Path to precomputed electrostatics matching the index of the samples in the dataframe. |
write_electrostatics_path | str | None | Path to which computed electrostatics should be written as a .h5 file. |
Copyright 2025 International Flavors and Fragrances, Wageningen University & Research
All original software developed for this project is licensed under the MIT License; you may not use this file except in compliance with the MIT license. You may obtain a copy of the MIT license at: https://mit-license.org
This repository includes third party software, with source code, binary distribution and their corresponding license files located in THIRD_PARTY_SOFTWARE
and THIRD_PARTY_LICENSES
, respectively.
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the MIT license are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Floris van der Flier - @florisvdf - floris.vanderflier@wur.nl