Biotrainer supports a lot of different configuration options. This file gives an overview about all of them and explains, which protocol needs which input files and which options are mandatory. (| means the exclusive OR, so choose one of the available options)
Choose a protocol that handles, how your provided data is interpreted:
protocol: residue_to_class | residues_to_class | sequence_to_class | sequence_to_value
To use interactions for your sequences, you need to indicate how embeddings of interacting sequences should be combined:
interaction: multiply | concat # Default: None
multiply
does an element-wise multiplication of the two embeddings, so torch.mul(emb1, emb2)
.
concat
concatenates both embeddings, so torch.concat([emb1, emb2])
.
The seed can also be changed (any positive integer):
seed: 1234 # Default: 42
You can also manually select a device for training:
device: cpu | cuda # Default: Uses cuda if cuda is available, otherwise cpu
The sequence ids associated with the generated or annotated splits can be saved in the output file.
Furthermore, after all training is done, the (best) model is evaluated on the test set.
The predictions it makes are stored in the output file, if the following flag is set to True
.
This behaviour is disabled by default, because the file can get very long for large datasets.
save_split_ids: True | False # Default: False
Sometimes, your sequence or embeddings file might contain more or less sequences than your corresponding labels file for a residue-level prediction task. If this is intended, you can tell biotrainer to simply ignore those redundant sequences:
ignore_file_inconsistencies: True | False # Default: False
If True
, biotrainer will also not check if you provided at least one sample for each split set.
A concrete output directory can be specified:
output_dir: path/to/output_dir # Default: path/to/config/output
Biotrainer automatically performs sanity checks on your test results. The checks are handled by the
validations
module. For example, a warning will be logged if the model predicts only one unique value for every
entry in the test set. The module also automatically calculates baselines (e.g. predicting only 1
or 0
for all
entries in the test set) with the respective metrics for suitable protocols. All baselines are also stored in the
output file.
The behaviour is enabled by default but can be switched off:
sanity_check: True | False # Default: True
Keep in mind that this can only be a first insight into the predictive capabilities of your model, no warnings in the logs do not imply that the results make (biological) sense!
Depending on the protocol that fits to your dataset, you have to provide different input files. For the data standards of these files, please refer to the data standardization document.
A sequence file has to be provided for every protocol:
sequences_file: path/to/sequences.fasta
For the residue_to_class protocol, a separate labels file also has to be provided:
labels_file: path/to/labels.fasta
The residue_to_class and residues_to_class protocols also support a mask file, if some residues should be masked during training:
mask_file: path/to/mask.fasta
The mask file must provide a value of 0
or 1
for each amino acid of every sequence. Unresolved residues are not
included in class weight calculation.
In biotrainer, it is possible to calculate embeddings automatically using bio_embeddings. To do this, only a valid embedder name has to be provided:
embedder_name: prottrans_t5_xl_u50 | esm | esm1b | seqvec | fasttext | word2vec | one_hot_encoding | ...
Take a look at bio_embeddings to find out about all the available embedding methods. A list of all current embedding config options can be found, for example, in this file.
If you want to use your own embedder directly in biotrainer, you can provide it as a python script.
You only need to extend the CustomEmbedder
class from the trainers
module. Please take a look at the
example implementations of ankh
and esm-2
in the custom_embedder example.
Then, use the embedder_name
option to declare the path to your script:
embedder_name: your_script.py # The script path must be relative to the configuration file.
Security measures are in place to prevent arbitrary code from execution. Still, this might introduce a security risk when running biotrainer as a remote service. Downloading of any custom_embedder source file during execution is therefore disabled. Please only run scripts from sources you trust.
It is also possible to provide a custom embeddings file in h5 format (take a look at the examples folder for more information). Please also have a look at the data standardization for the specification requirements of your embeddings.
Either provide a local file:
embeddings_file: path/to/embeddings.h5
You can also download your embeddings directly from a URL:
embeddings_file: ftp://examples.com/embeddings.h5 # Supports http, https, ftp
The file will be downloaded and stored in the path of your config file with prefix "downloaded_".
Note that embedder_name and embeddings_file are mutually exclusive. In case you provide your own embeddings, the experiment directory will be called custom_embeddings.
There are multiple options available to specify the model you want to train.
At first, choose the model architecture to be used:
model_choice: FNN | CNN | LogReg | LightAttention # Default: CNN
The available models depend on your chosen protocol.
'residue_to_class': {
CNN,
FNN,
LogReg
},
'residues_to_class': {
LightAttention
}
'sequence_to_class': {
FNN,
LogReg
},
'sequence_to_value': {
FNN,
LogReg
}
Specify an optimizer:
optimizer_choice: adam # Default: adam
and the learning rate (any positive float):
learning_rate: 1e-4 # Default: 1e-3
Specify the loss:
loss_choice: cross_entropy_loss | mean_squared_error
Note that mean_squared_error can only be applied to regression tasks, i.e. x_to_value.
For classification tasks, the loss can also be calculated with class weights:
use_class_weights: True | False # Default: False
Class weights only take the training data into account for calculation. Masked residues, denoted in a mask file, are not included in class weight calculation as well.
This section describes the available training and dataloader parameters.
You should declare the maximum number of epochs you want to train your model (any positive integer):
num_epochs: 20 # Default: 200
The build-in solver has an early-stop mechanism that prevents the model from over-fitting if the validation loss increases for too long. It can be controlled by providing a patience value (any positive integer):
patience: 20 # Default: 10
To provide a threshold that indicates, if the patience should decrease, an epsilon value can be specified (any positive float):
epsilon: 1e-3 # Default: 0.001
Early stop mechanism explanation:
If the current loss is smaller than the previous minimum loss minus the epsilon threshold, a stop count is reset to the patience value indicated above. Otherwise, if it is already 0, early stop is triggered and the best previous model loaded. If it is still above 0, patience gets decreased by one. Expressed in code:
def _early_stop(self, current_loss: float, epoch: int) -> bool:
if current_loss < (self._min_loss - self.epsilon):
self._min_loss = current_loss
self._stop_count = self.patience
# Save best model (overwrite if necessary)
self._save_checkpoint(epoch)
return False
else:
if self._stop_count == 0:
# Reload best model
self.load_checkpoint()
return True
else:
self._stop_count = self._stop_count - 1
return False
It is also possible to change the batch size (any positive integer):
batch_size: 32 # Default: 128
The dataloader can also shuffle the dataset on each epoch:
shuffle: True | False # Default: True
By default, biotrainer will use the set annotations provided via the sequence annotations. Separating the data this way into train, validation and test set is commonly known as hold-out cross validation.
cross_validation_config:
method: hold_out # Default: hold_out
This is the default option and specifying it in the config file is optional.
Additionally, biotrainer supports a set of other cross validation methods, which can be configured in the configuration file, as shown in the next sections. We assume knowledge of the methods and only provide the available configuration options. This article provides a more in-depth description of the implemented cross validation methods.
When using k-fold cross validation, sequences annotated with either "SET=train" or "VALIDATION=True" are combined to create the k splits. Biotrainer supports repeated, stratified and nested k-fold cross validation (and combinations of these techniques).
cross_validation_config:
method: k_fold
k: 2 # k >= 2
stratified: True # Default: False
repeat: 3 # Default: 1
The distribution of train to validation samples in the splits is (k-1):1
,
so three times more train than validation samples for k: 4
.
The stratified
option can also be used for regression tasks. In this case, the continuous values are converted
to bins to calculate the stratified splits.
To use nested k-fold cross validation, hyperparameters which should be optimized with the nested splits must be specified in the config file. This can be done by using lists explicitly, using pythonic list comprehensions or a range expression. The latter two have to be provided as string literals. The following example shows all three options:
use_class_weights: [True, False] # Explicit list
learning_rate: "[10**-x for x in [2, 3, 4]]" # List comprehension
batch_size: "range(8, 132, 4)" # Range expression
All parameters that change during training can be configured this way to be included in the hyperparameter optimization search. As search methods, random search and grid search have been implemented. A complete nested k-fold cross validation config could look like this:
cross_validation_config:
method: k_fold
k: 3 # k >= 2
stratified: True # Default: False
repeat: 1 # Default: 1
nested: True
nested_k: 2 # nested_k >= 2
search_method: random_search
n_max_evaluations_random: 3 # n_max_evaluations_random >= 2
Note that the total number of trained models will be k * nested_k * n_max_evaluations_random
for random_search
and k * nested_k * len(possible_grid_combinations)
, so the training process can get very resource-heavy!
Overview about all config options for k-fold cross validation:
cross_validation_config:
method: k_fold
k: 5 # k >= 2
stratified: True | False # Default: False
repeat: 1 # repeat >= 1, Default: 1
nested: True | False # Default: False
nested_k: 3 # nested_k >= 2, only nested
search_method: random_search | grid_search # No default, but must be configured for nested k-fold cv, only nested
n_max_evaluations_random: 3 # n_max_evaluations_random >= 2, only for random_search, only nested
This edge case of k-fold Cross Validation with (usually) very small validation sets is also implemented in biotrainer. Using it is quite simple:
cross_validation_config:
method: leave_p_out
p: 5 # p >= 1
Note that this might create a very high number of splits and is only recommended for small training and validation sets!
For both cross validation methods, you can also declare which metric to use to choose the best model (or the best hyperparameter combination for nested k-fold cross validation):
cross_validation_config:
method: leave_p_out
choose_by: loss | accuracy | precision | recall | rmse | ... # Default: loss
p: 5 # p >= 1
On clusters for example, training can get interrupted for a numerous reasons. The implemented auto_resume mode makes it possible to re-submit your job without changing anything in the configuration. It will automatically search for the latest available checkpoint at the default directory path. This behaviour is de-activated by default, but you can switch it on if necessary:
auto_resume: True | False # Default: True
This works for any cross validation method. Due to randomness in the computation (especially when using GPU or random layers in the network like dropout), results might differ between originally trained checkpoints and resumed training!
If you are using an already pretrained model and want to continue to train it for more epochs, you can use the following option:
pretrained_model: path/to/model_checkpoint.pt
Note that pretrained_model
only works in combination with hold_out
cross validation!
Biotrainer will now run until early stop was triggered,
or num_epochs - num_pretrained_epochs
(from the model state dict) is reached.
Note that pretrained_model
and auto_resume
options are incompatible.
auto_resume
should be used in case one needs to restart the training job multiple times.
pretrained_model
if one wants to continue to train a specific model.
Sometimes it might be useful to check your employed setup and architecture only on a small sub-sample of your total dataset. This can also be automatically done for you in biotrainer:
limited_sample_size: 100 # Default: -1, must be > 0 to be applied
Note that this value is applied only to the train dataset and embedding calculation is currently done for all sequences!