Releases: CDDLeiden/QSPRpred
Releases · CDDLeiden/QSPRpred
Version 3.2.1
Change Log
From v3.2.0 to v3.2.1
Fixes
- Add variable version to papyrus_filter for consistent version use.
Changes
None.
New Features
None.
Removed Features
None.
Version 3.2.0
Change Log
From v3.1.1 to v3.2.0
Fixes
- Fixed a bug in
ChempropModel
that caused it not to work with missing values in the
target column.
Changes
calibration_score
is now implemented under theMetric
class asCalibrationScore
.
New Features
- Added a range of new
metrics:BEDROC
,EnrichmentFactor
,RobustInitialEnhancement
,
Prevalence
,Sensitivity
,Specificity
,PositivePredictivity
,NegativePredictivity
,
CohenKappa
,BalancedPositivePredictivity
,BalancedNegativePredictivity
,
BalancedMatthewsCorrcoeff
,BalancedCohenKappa
,KSlope
,R20
,KPrimeSlope
,
RPrime20
,Pearson
,Spearman
,Kendall
,AverageFoldError
,
AbsoluteAverageFoldError
,PercentageWithinFoldError
- Added
MaskedMetric
which can be wrapped around any metric to mask datapoints
when a target value is missing. - Added a tutorial on model and data serialization.
ApplicabilityDomain
now has atransform
method that can be used to transform
a dataset to a continuous applicability domain score, such as the distance to the
nearest neighbor in the training set (an example was added to the
tutorials).
Removed Features
None.
Version 3.1.1
Change Log
From v3.0.2 to v3.1.1
Fixes
- Fixed a bug in
QSPRDataset
where property transformations were not applied. - Fixed a bug where an attached standardizer would be refit when calling
QSPRModel.predictMols
withuse_applicability_domain=True
. - Fixed random seed not set in
FoldsFromDataSplit.iterFolds
forClusterSplit
.
Changes
- renamed
PandasDataTable.transform
toPandasDataTable.transformProperties
- moved
imputeProperties
,dropEmptyProperties
andhasProperty
fromMoleculeTable
toPandasDataTable
. - removed
getProperties
,addProperty
,removeProperty
, now usePandasDataTable
methods directly. - Since the way descriptors are saved has changed, this release is incompatible with
previous data sets and models. However, these can be easily converted to the new
format by adding
a prefix with descriptor set name to the old descriptor tables. Feel free to contact
us if you require assistance with this. - Due to some changes in
rdkit-2023.9.6
, theadd_rdkit
option for molecule tables temporarily might not work.
This also affects the current ChemProp integration, which was not adapted to 2.0.0 yet.
In order to prevent these issues, QSPRpred now forces rdkit versionrdkit-2023.9.5
,
but we will be working on resolving these.
New Features
- Descriptors are now saved with prefixes to indicate the descriptor sets. This reduces
the chance of name collisions when using multiple descriptor sets. - Added new methods to
MoleculeTable
andQSARDataset
for more fine-grained control
of clearing, dropping and restoring of descriptor sets calculated for the dataset.dropDescriptorSets
will drop descriptors associated with the given descriptor
sets.dropDescriptors
will drop individual descriptors associated with the given
descriptor sets and properties.- All drop actions are restorable with
restoreDescriptorSets
unless explicitly
cleared from the data set with theclear
parameter ofdropDescriptorSets
.
- Added a proper API for parallelization backend selection and configuration (see
documentation ofParallelGenerator
andJITParallelGenerator
for more information). - Clusters can now be added to a
MoleculeTable
withaddClusters
and retrieved with
getClusters
, similar to scaffolds.
Removed Features
- removed support for PyBoost since the project was abandoned by the original developers and is no longer maintained
Version 3.0.2
Change Log
From v3.0.1 to v3.0.2
Fixes
- Fixed a bug where an attached standardizer would be refit when calling
QSPRModel.predictMols
withuse_applicability_domain=True
. - Fixed a bug with
use_applicability_domain=True
inQSPRModel.predictMols
where an error would be raised if there were invalid molecules in the input. - Fixed a bug where dataset type was not properly set to numeric
inMlChemADWrapper.contains
- Fixed a bug in
QSPRDataset
where property transformations were not applied. - Fixed a bug where an attached standardizer would be refit when calling
QSPRModel.predictMols
withuse_applicability_domain=True
. - Fixed random seed not set in
FoldsFromDataSplit.iterFolds
forClusterSplit
. - Fixed a bug where class ratios were shuffled in the
RatioDistributionAlgorithm
.
Changes
- The module containing the sole model base class (
QSPRModel
) was renamed
frommodels
tomodel
. - Restrictions on
numpy
versions were removed to allow for more flexibility in
package installations. However, theBorutaFilter
feature selection method does not
function withnumpy
versions 1.24.0 and above. Therefore, this functionality now
requires a downgrade tonumpy
version 1.23.0 or lower. This was reflected in the
documentation andnumpy
itself outputs a reasonable error message if the version is
incompatible. - Data type in
MlChemADWrapper
is now set tofloat64
by default, instead
offloat32
. - Saving of models after hyperparameter optimization was improved to ensure parameters
are always propagated to the underlying estimator as well.
New Features
- The
DataFrameDescriptorSet
class was extended to allow more flexibility when joining
custom descriptor sets. - Added the
prepMols
method toDescriptorSet
to allow separated customization of
molecule preparation before descriptor calculation. - The package can now be installed from the PyPI repository 🐍📦.
- New argument (
refit_optimal
) was added toHyperparameterOptimization.optimize()
method to make refitting of the model with optimal parameters easier.
Removed Features
None.
v3.0.1
Change Log
From v3.0.0 to v3.0.1
Fixes
- Fixed a bug in
QSPRDataset
where property transformations were not applied.
Changes
- renamed
PandasDataTable.transform
toPandasDataTable.transformProperties
- moved
imputeProperties
,dropEmptyProperties
andhasProperty
fromMoleculeTable
toPandasDataTable
. - removed
getProperties
,addProperty
,removeProperty
, now usePandasDataTable
methods directly.
New Features
Removed Features
v3.0.0
Change Log
From v2.1.1 to v3.0.0
Fixes
- Fixed random seeds to give reproducible results. Each dataset is initialized with a
single random state (either from the constructor or a random number generator) which
is used in all subsequent random operations. Each model is initialized with a single
random state as well: it uses the random state from the dataset, unless it's overriden
in the constructor. When a dataset is saved to a file so is its random state, which is
used by the dataset when the dataset is reloaded. - fixed error with serialization of the
DNNModel.params
attribute, when no parameters
are set. - Fix bug with saving predictions from classification model
whenModelAssessor.useProba
set toFalse
. - Add missing implementation of
QSPRDataset.removeProperty
- Improved behavior of the Papyrus data source (does not attempt to connect to the
internet if the data set already exists). - It is now possible to define new descriptor sets outside the package without errors.
- Basic consistency of models is also checked in the unit test suite, including in
theqsprpred.extra
package. - Fixed a problem with feature standardizer being retrained on prediction data when a
prediction from SMILES was invoked. This affected all versions of the package higher
or equal tov2.1.0
. - Fixes to the
fromMolTable
method in various data set implementations, in particular
in copying of the feature standardizer and other settings. - Fixed not working
cluster
split and--imputation
fromdata_CLI.py
. - Fixed a problem with
ProteinDescriptorSet.getDescriptors
returning descriptors in
wrong order withPandas <v2.2.0
.
Changes
- The model is now independent of data sets. This means that the model no longer
contains a reference to the data set it was trained on.- The
fitAttached
method was replaced withfitDataset
, which takes the data set
as
an argument. - Assessors now also accept a data set as a second argument. Therefore, the same
assessor
can be used to assess different data sets with the same model settings. - The monitoring API was also slightly modified to reflect this change.
- If a model requires initialization of some settings from data, this can be done in
itsinitFromDataset
method, which takes the data set as an argument. This method
is called automatically before fitting, model assessment, and hyperparameter
optimization.
- The
- The whole package was refactored to simplify certain commonly used imports. The
tutorial code was adjusted to reflect that. - The jupyter notebooks in the tutorial now pass a random state to ensure consistent
results. - The default parameter values for
STFullyConnected
have changed fromn_epochs
=
1000 ton_epochs
= 100, fromneurons_h1
= 4000 toneurons_h1
= 256
andneurons_hx
= 1000 toneurons_hx
= 128. - Rename
HyperParameterOptimization
toHyperparameterOptimization
. TargetProperty.fromList
andTargetProperty.fromDict
now accept a both a string and
aTargetTask
as thetask
argument,
without having to set thetask_from_str
argument, which is now deprecated.- Make
EarlyStopping.mode
flexible forQSPRModel.fitDataset
. save_params
argument added toOptunaOptimization
to save the best hyperparameters
to the model (default:True
).- We now use
jsonpickle
for object serialization, which is more flexible than the
non-standard approach before, but it also means previous models will not be compatible
with this version. SklearnMetric
was renamed toSklearnMetrics
, it now also accepts an scikit-learn
scorer name as input.QSPRModel.fitDataset
now accepts asave_model
(default:True
)
andsave_dataset
(default:False
) argument to save the model and dataset to a file
after fitting.- Tutorials were completely rewritten and expanded. They can now be found in
thetutorials
folder instead of thetutorial
folder. MetricsPlot
now supports multi-class and multi-task classification models.CorrelationPlot
now supports multi-task regression models.- The behaviour of
QSPRDataset
was changed with regards to target properties. It now
remembers the original state of any target property and all changes are performed in
place on the original property column (i.e. conversion to multi-class classification).
This is to always maintain the same property name and always have the option to reset
it to the raw original state (i.e. if we switch to regression or want to repeat a
transformation). - The default log level for the package was changed from
INFO
toWARNING
. A new
tutorial
was added to explain how to change the log level. RepeatsFilter
argumentyear_name
renamed totime_col
and
arugmentadditional_cols
added.- The
perc
argument ofBorutaPy
can now be set from the CLI. - Descriptor calculators (previously used to aggregate and manage descriptor sets) were
completely removed from the API and descriptor sets can now be added directly to the
molecule tables. - The rdkit-like descriptor and fingerprint retrieval functions were removed from the
API because they complicated implementation of customized descriptors. - The
apply
method was simplified and a new API was clearly defined for parallel
processing of properties over data sets. To improve molecule processing,
aprocessMols
method was added toMoleculeTable
.
New Features
- The
qsprpred.benchmarks
module was added, which contains functions to easily
benchmark
models on datasets. - Most unit tests now have a variant that checks whether using a fixed random seed gives
reproducible results. - The build pipeline now contains a check that the jupyter notebooks give the same
results as ones that were observed before. - Added
FitMonitor
,AssessorMonitor
, andHyperparameterOptimizationMonitor
base
classes to monitor the progress of fitting, assessing, and hyperparameter
optimization, respectively. - Added
BaseMonitor
class to internally keep track of the progress of a fitting,
assessing, or hyperparameter optimization process. - Added
FileMonitor
class to save the progress of a fitting, assessing, or
hyperparameter optimization process to files. - Added
WandBMonitor
class to save the progress of a fitting, assessing, or
hyperparameter optimization process to Weights & Biases. - Added
NullMonitor
class to ignore the progress of a fitting, assessing, or
hyperparameter optimization process. - Added
ListMonitor
class to combine multiple monitors. - Cross-validation, testing, hyperparameter optimization and early-stopping were made
more flexible by allowing custom splitting and fold generation strategies. A tutorial
showcasing these features was created. - Added a
reset
method toQSPRDataset
, which resets splits and loads all descriptors
into the training set matrix again. - Added
ConfusionMatrixPlot
to plot confusion matrices. - Added the
searchWithIndex
,searchOnProperty
,searchWithSMARTS
andsample
toMoleculeTable
to facilitate more advanced sampling from data. - Assessors now have the
split_multitask_scores
flag that can be used to evaluate each
task seperately with single-task metrics. MoleculeDataSet
s now has thesmiles
property to easily get smiles.- A Docker-based runner in
testing/runner
can now be used to test GPU-enabled features
and run the full CI pipeline. - It is now possible to save
PandasDataTable
s to a CSV file instead of the default
pickle format (slower, but more human-readable). - New
RegressionPlot
classWilliamsPlot
added to plot Williams plots. - Data sets can now be optionally stored in the
csv
format and not just as a pickle
file. This makes it easier to debug and share data sets, but it is slower to load and
save. - Added
ApplicabilityDomain
class to calculate applicability domain and filter
outliers from test sets.
Removed Features
- The
Metric
interface has been simplified in order to make it easier to implement
custom metrics. TheMetric
interface now only requires the implementation of
the__call__
method, which takes predictions and returns afloat
. TheMetric
interface no longer requires the implementation
ofneedsDiscreteToScore
,needsProbaToScore
andsupportsTask
. However, this means
the base functionality ofcheckMetricCompatibility
,isClassificationMetric
andisRegressionMetric
are no longer available. - Default hyperparameter search space file, no longer available.
v2.1.1
Change Log
From v2.1.0 to v2.1.1
Fixes
⚠️ Important!⚠️ Fixed bug inpredictMols
where thefeature_standardizer
was
not being applied to the calculated features. This bug was introduced in v2.1.0.
Models trained with v2.1.0 are compatible with v2.1.1, make sure to update
QSPRpred to v2.1.1 to ensure that thefeature_standardizer
is applied when
predicting on new molecules.
Changes
New Features
Removed Features
v2.1.0
Change Log
From v2.0.1 to v2.1.0.a2
Fixes
- fixed error with serialization of the
DataFrameDescriptorSet
(#63) - Papyrus descriptors are not fetched by default anymore from the
Papyrus
adapter, which caused fetching of unnecessary data. - A potential bug in new version of pandas broke scaffold generation so a workaround was implemented.
Changes
QSPRModel.evaluate
moved to a separate classEvaluationMethod
inqsprpred.models.interfaces
, with subclasses for cross-validation and making predictions on a test set inqsprpred.models.evaluation_methods
(CrossValidation
andEvaluateTestSetPerformance
respectively).QSPRModel
attributescoreFunc
is removed.- 'qspr/models' is no longer added to the output path of
QSPRModel.save
, allowing for complete control over the output path. SKlearnMetrics.supportsTask
now uses a dictionary like dict[ModelTasks, list[str]] to map tasks to supported metric names. (#53)GBMTRandomSplit
andScaffoldSplit
now use theGBMTDataSplit
to create balanced splits.RandomSplit
still functions the same way as a completely random test split.PCMSplit
replacesStratifiedPerTarget
and is compatible withRandomSplit
,ScaffoldSplit
andClusterSplit
.DuplicatesFilter
refactored toRepeatsFilter
, as it also captures scenarios where triplicates/quadruplicates are found in the dataset. These scenarios are now also covered by the respective UnitTest.- The versioning scheme of development snapshots has changed from
devX
toalphaX
/betaX
, whereX
is an integer that increments with each release. - The following model class have been renamed and moved:
models.models.QSPRsklearn
>models.sklearn.SklearnModel
deep.models.QSPRDNN
>extra.gpu.models.dnn.DNNModel
extra.models.pcm.ModelPCM
>extra.models.pcm.PCMModel
extra.models.pcm.QSPRsklearnPCM
>extra.models.pcm.SklearnPCMModel
- The command line interface modules now use input and output file paths instead
of automatically placing all files in a subfolderqspr
, allowing for more
control over the output and input paths.
New Features
GBMTDataSplit
- parent class to create globally balanced splits with the gbmt-split package.ClusterSplit
- splits data based clustering of molecular fingerprints (usesGBMTDataSplit
).- Raise error if search space for optuna optimization is missing search space type annotation or if type not in list.
- When installing package with pip, the commit hash and date of the installation is saved into
qsprpred._version
HyperParameterOptimization
classes now accept aevaluation_method
argument, which is an instance ofEvaluationMethod
(see above). This allows for hyperparameter optimization to be performed on a test set, or on a cross-validation set. (#11)HyperParameterOptimization
now acceptsscore_aggregation
argument, which is a function that takes a list of scores and returns a single score. This allows for the use of different aggregation functions, such asnp.mean
ornp.median
to combine scores from different folds. (#45)- A new tutorial
adding_new_components.ipynb
has been added to thetutorials
folder, which demonstrates how to add new model to QSPRpred. - A new function
Metrics.checkMetricCompatibility
has been added, which checks if a metric is compatible with a given task and a given prediction methods (i.e.predict
orpredictProba
) - In
EvaluationMethod
(see above), an attributeuse_proba
has been added, which determines whether thepredict
orpredictProba
method is used to make predictions (#56). - Add new descriptorset
SmilesDesc
to use the smiles strings as a descriptor. - New module
early_stopping
with classesEarlyStopping
andEarlyStoppingMode
has been added. This module allows for more control over early stopping in models that support it. - Add new descriptorset
SmilesDesc
to use the smiles strings as a descriptor. - Refactoring of the test suite under
qsprpred.data
and improvement of temporary file handling (!114). PyBoostModel
- QSPRpred wrapper for py-boost models. Requires optionalpyboost
dependencies.ChempropModel
- QSPRpred wrapper for Chemprop models. Requires optionaldeep
dependencies.- The
data_CLI
argument--log_transform
(-lt
) has been changed to--transform_data
(-t
), which now accepts a number of transformations to apply to the target data. Available transformations arelog
,log10
,log2
,sqrt
,cbrt
,exp
,exp2
,exp10
,square
,cube
,reciprocal
. - New
data_CLI
,model_CLI
andpredict_CLI
argument--skip_backup
(-sb
) to skip the backup of the output files. WARNING: This will overwrite existing files.
Removed Features
StratifiedPerTarget
is replaced byPCMSplit
.
v2.0.1
Change Log
From v2.0.0 to v2.0.1
Fixes
- Requirement python version in pyproject.toml updated to 3.10, as older version of python don't support the type hinting used in the code.
- Corrected type hinting for
QSPRModel.handleInvalidsInPredictions
, which resulted in an error when importing the package in google colab. - The
predictMols
method returned random predictions in v2.0.0 due to unpatched shuffling code. This has now been fixed.
Changes
New Features
- raise error if search space for optuna optimization is missing search space type annotation or if type not in list
v2.0.0
Change Log
From v1.3.1 to v2.0.0
Fixes
- more robust error handling of invalid molecules in
MoleculeTable
- Not all scorers in
supported_scoring
were actually working in the multi-class case, the scorer support is now
divided by single and multiclass support (moved tometrics.py
, see also New Features). - Instead of all smiles, only invalid smiles are now printed to the log when they are removed.
- problems with PaDEL descriptors and fingerprints on Linux were fixed
- fixed serialization issues with
DataFrameDescriptorSet
and saving and loading of MSA for PCM descriptor calculations - the Papyrus adapter was fixed so that the quality and data set filtering options work properly (before only high quality Papyrus++ data was fetched no matter the options)
- previously, in some cases cross-validation splits might not have been shuffled during hyperparameter optimization and evaluation on cross-validation folds (this might have resulted in suboptimal cross-validation performance and bad choices of hyperparameters), a fix was made in b029e78
- score_func can now be set in
QSPRModel
.
Changes
- Hyperparameter optimization moved to a separate class from
QSPRModel.bayesOptimization
andQSPRModel.gridSearch
toOptunaOptimization
andGridSearchOptimization
in the new moduleqsprpred.models.param_optimzation
with a base claseHyperParameterOptimization
inqsprpred.models.interfaces
. ⚠️ Important!⚠️ QSPRModel
attributemodel
now calledestimator
, which is always an instance ofalg
, whilealg
may no longer be an instance but only a Type.- Converting input data for
qsprpred.models.neural_network.Base
to dataloaders now executed in thefit
andpredict
functions instead of in theqspred.deep.models.QSPRDNN
class. MoleculeTable
now uses a custom index. When aMoleculeTable
is created a new column (QSPRID
) is added (overwritten if already present), which is then used as the index of the underlying data frame.- It is possible to override this with a custom index by passing
index_cols
to theMoleculeTable
constructor. These columns will be then used as index or a multi-index if more than one column is passed. - Due to this change,
scaffoldsplit
now uses these IDs instead of unreliable SMILES strings (see documentation for the new API).
- It is possible to override this with a custom index by passing
- If there are invalid molecules in
MoleculeTable
,addDescriptors
now fails by default. You can disable this by passingfail_on_invalid=False
to the method. - To support multitask modelling, the representation of the target in the
QSPRdataset
has changed to a list of
TargetProperty
s (see New Features). These can be automatically initizalid from dictionaries in theQSPRdataset
init. - A
fill_value
argument was also added to thepredict_CLI
script to allow for filling missing values in the
prediction data set as well. ⚠️ Important!⚠️ setup.py
andsetup.cfg
were substituted withpyproject.toml
andMANIFEST.in
. A lighter version of the package is now the default installation option!!!- Installation options for the optional dependencies are described in README.md
- CI scripts were modified to test the package on the full version. See changes in
.gitlab-ci.yml
. - Features using the extra dependencies were moved to
qsprpred.extra
andqsprpred.deep
subpackages. The structure of the subpackages is the same as of the main package, so you just need to remember to useqsprpred.extra
orqsprpred.deep
instead of justqsprpred
in your imports if you were using these features from the main package before.
- The way descriptors are stored in
MoleculeTable
was changed. They now reside in their ownDescriptorTable
instances that are linked to the orginalMoleculeTable
- This change was made to allow several types of descriptors to be calculated and used efficiently (facilitated by a the
DescriptorsCalculators
interface) - Unfortunately, this change is not backwards compatible, so previously pickled
MoleculeTable
instances will not work with this version. There were also changes to how models handle multiple descriptor types, which also makes them incompatible with previous versions. However, this can be fixed by modifying the old JSON files as illustrated in commits 7d3f863 and 6564f02.
- This change was made to allow several types of descriptors to be calculated and used efficiently (facilitated by a the
- 'LowVarianceFilter` now includes boundary in the filtered features, e.g. if threshold is 0.1, also features that
have a variance of 0.1 will be removed. - Added the ExtendedValenceSignature molecular descriptor based on Jean-Loup Faulon's work.
- removed default parameter setting scikit-learn SVC and SVR
max_iter
10000. - added
matthews_corrcoef
to the supported metrics for binary classification.
New Features
- New feature split
ManualSplit
for splitting data by a user-defined column - The index of the
MoleculeTable
can now be used to relate cross-validation and test outputs to the original molecules. Therefore, the index is now also saved in the model training outputs. - the
Papyrus.getData()
method now acceptsactivity_types
parameter to select a list of activity types to get. - Added the
checkMols
method toMoleculeTable
to use for indication of invalid molecules in the data. - Support for Sklearn Multitask modelling
- New class abstract class
Metric
, which is an abstract base class that allows for creating custom scorers. - Subclass
SklearnMetric
of theMetric
class, this class wraps the sklearn metrics, to allow for checking
the compatibility of each Sklearn scoring function with theQSPRSklearn
model type. - New class
TargetProperty
, to allow for multitask modelling, aQSPRdataset
has to have the option of multiple
targetproperties. To support this a targer property is now defined seperatly from the dataset as aTargetProperty
instance, which holds the information on name,TargetTask
(see also Changes) and threshold of the property. - Support for protein descriptors and PCM modeling was added.
- The
PCMDataSet
class was introduced that extendsQSPRDataset
and adds theaddProteinDescriptors
method, which can be used to calculate protein descriptors by linking information from the table with sequencing data.
- The
- Support for precalculated descriptors was added with
addCustomDescriptors
method ofMoleculeTable
.- It allows for adding precalculated descriptors to the
MoleculeTable
by linking the information from the table with external precalculated descriptors.
- It allows for adding precalculated descriptors to the
- The tutorial was improved with more detailed sections on data preparation and PCM modelling added.
- We agreed on and adopted a style guide for contributions to the package. This is described and exemplified in
docs/style_guide.py
. This is also supported by several development tools that were configured to check and automatically format the code. Instructions are included in the example file as well. - Style guide implemented. As a consequence, some classes, methods, and attributes were renamed to comply with the style guide. Some changes were:
- Fingerprint functions from RDKit are now also implemented. For checking the available fingerprints in qsprpred, the user can now access AVAIL_FPS through
from qsprpred.data.utils.descriptor_utils.fingerprints import AVAIL_FPS
. Fingerprint
abstract base class now moved toqsprpred.data.utils.descriptor_utils.interfaces
.- Instance attributes are now written in camelCase, and method arguments are snake_case. As an example of this, the old parameter
descsets
fromMoleculeDescriptorsCalculator
is now renamed asdesc_sets
, stored as the attributeself.descSets
. Functions are still written in snake_case.
- Fingerprint functions from RDKit are now also implemented. For checking the available fingerprints in qsprpred, the user can now access AVAIL_FPS through