v3.0.0
Change Log
From v2.1.1 to v3.0.0
Fixes
- Fixed random seeds to give reproducible results. Each dataset is initialized with a
single random state (either from the constructor or a random number generator) which
is used in all subsequent random operations. Each model is initialized with a single
random state as well: it uses the random state from the dataset, unless it's overriden
in the constructor. When a dataset is saved to a file so is its random state, which is
used by the dataset when the dataset is reloaded. - fixed error with serialization of the
DNNModel.params
attribute, when no parameters
are set. - Fix bug with saving predictions from classification model
whenModelAssessor.useProba
set toFalse
. - Add missing implementation of
QSPRDataset.removeProperty
- Improved behavior of the Papyrus data source (does not attempt to connect to the
internet if the data set already exists). - It is now possible to define new descriptor sets outside the package without errors.
- Basic consistency of models is also checked in the unit test suite, including in
theqsprpred.extra
package. - Fixed a problem with feature standardizer being retrained on prediction data when a
prediction from SMILES was invoked. This affected all versions of the package higher
or equal tov2.1.0
. - Fixes to the
fromMolTable
method in various data set implementations, in particular
in copying of the feature standardizer and other settings. - Fixed not working
cluster
split and--imputation
fromdata_CLI.py
. - Fixed a problem with
ProteinDescriptorSet.getDescriptors
returning descriptors in
wrong order withPandas <v2.2.0
.
Changes
- The model is now independent of data sets. This means that the model no longer
contains a reference to the data set it was trained on.- The
fitAttached
method was replaced withfitDataset
, which takes the data set
as
an argument. - Assessors now also accept a data set as a second argument. Therefore, the same
assessor
can be used to assess different data sets with the same model settings. - The monitoring API was also slightly modified to reflect this change.
- If a model requires initialization of some settings from data, this can be done in
itsinitFromDataset
method, which takes the data set as an argument. This method
is called automatically before fitting, model assessment, and hyperparameter
optimization.
- The
- The whole package was refactored to simplify certain commonly used imports. The
tutorial code was adjusted to reflect that. - The jupyter notebooks in the tutorial now pass a random state to ensure consistent
results. - The default parameter values for
STFullyConnected
have changed fromn_epochs
=
1000 ton_epochs
= 100, fromneurons_h1
= 4000 toneurons_h1
= 256
andneurons_hx
= 1000 toneurons_hx
= 128. - Rename
HyperParameterOptimization
toHyperparameterOptimization
. TargetProperty.fromList
andTargetProperty.fromDict
now accept a both a string and
aTargetTask
as thetask
argument,
without having to set thetask_from_str
argument, which is now deprecated.- Make
EarlyStopping.mode
flexible forQSPRModel.fitDataset
. save_params
argument added toOptunaOptimization
to save the best hyperparameters
to the model (default:True
).- We now use
jsonpickle
for object serialization, which is more flexible than the
non-standard approach before, but it also means previous models will not be compatible
with this version. SklearnMetric
was renamed toSklearnMetrics
, it now also accepts an scikit-learn
scorer name as input.QSPRModel.fitDataset
now accepts asave_model
(default:True
)
andsave_dataset
(default:False
) argument to save the model and dataset to a file
after fitting.- Tutorials were completely rewritten and expanded. They can now be found in
thetutorials
folder instead of thetutorial
folder. MetricsPlot
now supports multi-class and multi-task classification models.CorrelationPlot
now supports multi-task regression models.- The behaviour of
QSPRDataset
was changed with regards to target properties. It now
remembers the original state of any target property and all changes are performed in
place on the original property column (i.e. conversion to multi-class classification).
This is to always maintain the same property name and always have the option to reset
it to the raw original state (i.e. if we switch to regression or want to repeat a
transformation). - The default log level for the package was changed from
INFO
toWARNING
. A new
tutorial
was added to explain how to change the log level. RepeatsFilter
argumentyear_name
renamed totime_col
and
arugmentadditional_cols
added.- The
perc
argument ofBorutaPy
can now be set from the CLI. - Descriptor calculators (previously used to aggregate and manage descriptor sets) were
completely removed from the API and descriptor sets can now be added directly to the
molecule tables. - The rdkit-like descriptor and fingerprint retrieval functions were removed from the
API because they complicated implementation of customized descriptors. - The
apply
method was simplified and a new API was clearly defined for parallel
processing of properties over data sets. To improve molecule processing,
aprocessMols
method was added toMoleculeTable
.
New Features
- The
qsprpred.benchmarks
module was added, which contains functions to easily
benchmark
models on datasets. - Most unit tests now have a variant that checks whether using a fixed random seed gives
reproducible results. - The build pipeline now contains a check that the jupyter notebooks give the same
results as ones that were observed before. - Added
FitMonitor
,AssessorMonitor
, andHyperparameterOptimizationMonitor
base
classes to monitor the progress of fitting, assessing, and hyperparameter
optimization, respectively. - Added
BaseMonitor
class to internally keep track of the progress of a fitting,
assessing, or hyperparameter optimization process. - Added
FileMonitor
class to save the progress of a fitting, assessing, or
hyperparameter optimization process to files. - Added
WandBMonitor
class to save the progress of a fitting, assessing, or
hyperparameter optimization process to Weights & Biases. - Added
NullMonitor
class to ignore the progress of a fitting, assessing, or
hyperparameter optimization process. - Added
ListMonitor
class to combine multiple monitors. - Cross-validation, testing, hyperparameter optimization and early-stopping were made
more flexible by allowing custom splitting and fold generation strategies. A tutorial
showcasing these features was created. - Added a
reset
method toQSPRDataset
, which resets splits and loads all descriptors
into the training set matrix again. - Added
ConfusionMatrixPlot
to plot confusion matrices. - Added the
searchWithIndex
,searchOnProperty
,searchWithSMARTS
andsample
toMoleculeTable
to facilitate more advanced sampling from data. - Assessors now have the
split_multitask_scores
flag that can be used to evaluate each
task seperately with single-task metrics. MoleculeDataSet
s now has thesmiles
property to easily get smiles.- A Docker-based runner in
testing/runner
can now be used to test GPU-enabled features
and run the full CI pipeline. - It is now possible to save
PandasDataTable
s to a CSV file instead of the default
pickle format (slower, but more human-readable). - New
RegressionPlot
classWilliamsPlot
added to plot Williams plots. - Data sets can now be optionally stored in the
csv
format and not just as a pickle
file. This makes it easier to debug and share data sets, but it is slower to load and
save. - Added
ApplicabilityDomain
class to calculate applicability domain and filter
outliers from test sets.
Removed Features
- The
Metric
interface has been simplified in order to make it easier to implement
custom metrics. TheMetric
interface now only requires the implementation of
the__call__
method, which takes predictions and returns afloat
. TheMetric
interface no longer requires the implementation
ofneedsDiscreteToScore
,needsProbaToScore
andsupportsTask
. However, this means
the base functionality ofcheckMetricCompatibility
,isClassificationMetric
andisRegressionMetric
are no longer available. - Default hyperparameter search space file, no longer available.