Skip to content

Commit

Permalink
Merge branch 'dev' into refactor/gpu_models
Browse files Browse the repository at this point in the history
  • Loading branch information
martin-sicho committed Mar 20, 2024
2 parents a9e8833 + 7f63e85 commit 733a106
Show file tree
Hide file tree
Showing 60 changed files with 13,855 additions and 12,411 deletions.
140 changes: 8 additions & 132 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,22 @@
# Change Log

From v2.1.1 to v3.0.0
From v3.0.2 to v3.0.3

## Fixes

- Fixed random seeds to give reproducible results. Each dataset is initialized with a
single random state (either from the constructor or a random number generator) which
is used in all subsequent random operations. Each model is initialized with a single
random state as well: it uses the random state from the dataset, unless it's overriden
in the constructor. When a dataset is saved to a file so is its random state, which is
used by the dataset when the dataset is reloaded.
- fixed error with serialization of the `DNNModel.params` attribute, when no parameters
are set.
- Fix bug with saving predictions from classification model
when `ModelAssessor.useProba` set to `False`.
- Add missing implementation of `QSPRDataset.removeProperty`
- Improved behavior of the Papyrus data source (does not attempt to connect to the
internet if the data set already exists).
- It is now possible to define new descriptor sets outside the package without errors.
- Basic consistency of models is also checked in the unit test suite, including in
the `qsprpred.extra` package.
- Fixed a problem with feature standardizer being retrained on prediction data when a
prediction from SMILES was invoked. This affected all versions of the package higher
or equal to `v2.1.0`.
- Fixes to the `fromMolTable` method in various data set implementations, in particular
in copying of the feature standardizer and other settings.
- Fixed not working `cluster` split and `--imputation` from `data_CLI.py`.
- Fixed a problem with `ProteinDescriptorSet.getDescriptors` returning descriptors in
wrong order with `Pandas <v2.2.0`.
- Fixed a bug where an attached standardizer would be refit when calling
`QSPRModel.predictMols` with `use_applicability_domain=True`.
- Fixed random seed not set in `FoldsFromDataSplit.iterFolds` for `ClusterSplit`.

## Changes

- The model is now independent of data sets. This means that the model no longer
contains a reference to the data set it was trained on.
- The `fitAttached` method was replaced with `fitDataset`, which takes the data set
as
an argument.
- Assessors now also accept a data set as a second argument. Therefore, the same
assessor
can be used to assess different data sets with the same model settings.
- The monitoring API was also slightly modified to reflect this change.
- If a model requires initialization of some settings from data, this can be done in
its `initFromDataset` method, which takes the data set as an argument. This method
is called automatically before fitting, model assessment, and hyperparameter
optimization.
- The whole package was refactored to simplify certain commonly used imports. The
tutorial code was adjusted to reflect that.
- The jupyter notebooks in the tutorial now pass a random state to ensure consistent
results.
- The default parameter values for `STFullyConnected` have changed from `n_epochs` =
1000 to `n_epochs` = 100, from `neurons_h1` = 4000 to `neurons_h1` = 256
and `neurons_hx` = 1000 to `neurons_hx` = 128.
- Rename `HyperParameterOptimization` to `HyperparameterOptimization`.
- `TargetProperty.fromList` and `TargetProperty.fromDict` now accept a both a string and
a `TargetTask` as the `task` argument,
without having to set the `task_from_str` argument, which is now deprecated.
- Make `EarlyStopping.mode` flexible for `QSPRModel.fitDataset`.
- `save_params` argument added to `OptunaOptimization` to save the best hyperparameters
to the model (default: `True`).
- We now use `jsonpickle` for object serialization, which is more flexible than the
non-standard approach before, but it also means previous models will not be compatible
with this version.
- `SklearnMetric` was renamed to `SklearnMetrics`, it now also accepts an scikit-learn
scorer name as input.
- `QSPRModel.fitDataset` now accepts a `save_model` (default: `True`)
and `save_dataset` (default: `False`) argument to save the model and dataset to a file
after fitting.
- Tutorials were completely rewritten and expanded. They can now be found in
the `tutorials` folder instead of the `tutorial` folder.
- `MetricsPlot` now supports multi-class and multi-task classification models.
- `CorrelationPlot` now supports multi-task regression models.
- The behaviour of `QSPRDataset` was changed with regards to target properties. It now
remembers the original state of any target property and all changes are performed in
place on the original property column (i.e. conversion to multi-class classification).
This is to always maintain the same property name and always have the option to reset
it to the raw original state (i.e. if we switch to regression or want to repeat a
transformation).
- The default log level for the package was changed from `INFO` to `WARNING`. A new
tutorial
was added to explain how to change the log level.
- `RepeatsFilter` argument `year_name` renamed to `time_col` and
arugment `additional_cols` added.
- The `perc` argument of `BorutaPy` can now be set from the CLI.
- Descriptor calculators (previously used to aggregate and manage descriptor sets) were
completely removed from the API and descriptor sets can now be added directly to the
molecule tables.
- The rdkit-like descriptor and fingerprint retrieval functions were removed from the
API because they complicated implementation of customized descriptors.
- The `apply` method was simplified and a new API was clearly defined for parallel
processing of properties over data sets. To improve molecule processing,
a `processMols` method was added to `MoleculeTable`.
None.

## New Features

- The `qsprpred.benchmarks` module was added, which contains functions to easily
benchmark
models on datasets.
- Most unit tests now have a variant that checks whether using a fixed random seed gives
reproducible results.
- The build pipeline now contains a check that the jupyter notebooks give the same
results as ones that were observed before.
- Added `FitMonitor`, `AssessorMonitor`, and `HyperparameterOptimizationMonitor` base
classes to monitor the progress of fitting, assessing, and hyperparameter
optimization, respectively.
- Added `BaseMonitor` class to internally keep track of the progress of a fitting,
assessing, or hyperparameter optimization process.
- Added `FileMonitor` class to save the progress of a fitting, assessing, or
hyperparameter optimization process to files.
- Added `WandBMonitor` class to save the progress of a fitting, assessing, or
hyperparameter optimization process to [Weights & Biases](https://wandb.ai/).
- Added `NullMonitor` class to ignore the progress of a fitting, assessing, or
hyperparameter optimization process.
- Added `ListMonitor` class to combine multiple monitors.
- Cross-validation, testing, hyperparameter optimization and early-stopping were made
more flexible by allowing custom splitting and fold generation strategies. A tutorial
showcasing these features was created.
- Added a `reset` method to `QSPRDataset`, which resets splits and loads all descriptors
into the training set matrix again.
- Added `ConfusionMatrixPlot` to plot confusion matrices.
- Added the `searchWithIndex`, `searchOnProperty`, `searchWithSMARTS` and `sample`
to `MoleculeTable` to facilitate more advanced sampling from data.
- Assessors now have the `split_multitask_scores` flag that can be used to evaluate each
task seperately with single-task metrics.
- `MoleculeDataSet`s now has the `smiles` property to easily get smiles.
- A Docker-based runner in `testing/runner` can now be used to test GPU-enabled features
and run the full CI pipeline.
- It is now possible to save `PandasDataTable`s to a CSV file instead of the default
pickle format (slower, but more human-readable).
- New `RegressionPlot` class `WilliamsPlot` added to plot Williams plots.
- Data sets can now be optionally stored in the `csv` format and not just as a pickle
file. This makes it easier to debug and share data sets, but it is slower to load and
save.
- Added `ApplicabilityDomain` class to calculate applicability domain and filter
outliers from test sets.
- Added the `prepMols` method to `DescriptorSet` to allow separated customization of
molecule preparation before descriptor calculation.

## Removed Features

- The `Metric` interface has been simplified in order to make it easier to implement
custom metrics. The `Metric` interface now only requires the implementation of
the `__call__` method, which takes predictions and returns a `float`. The `Metric`
interface no longer requires the implementation
of `needsDiscreteToScore`, `needsProbaToScore` and `supportsTask`. However, this means
the base functionality of `checkMetricCompatibility`, `isClassificationMetric`
and `isRegressionMetric` are no longer available.
- Default hyperparameter search space file, no longer available.
None.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ custom implementation of the `MSAProvider` class will have to be made.

After installation, you will have access to various command line features and you can
use the Python API directly (
see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you
see [Documentation](https://cddleiden.github.io/QSPRpred/docs/)). For a quick start, you
can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which
document the use of the Python API to build different types of models. The tutorials as
well as the documentation are still work in progress, and we will be happy for any
Expand Down
5 changes: 1 addition & 4 deletions docs/cli_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,7 @@ e.g. the help message for the :code:`QSPRpred.data_CLI` script can be shown as f
A simple command-line workflow to prepare your dataset and train QSPR models is given below (see :ref:`CLI Example`).

If you want more control over the inputs and outputs or want to customize QSPRpred for your purpose,
you can also use the Python API directly (see `source code <https://github.com/CDDLeiden/QSPRpred/tree/main/tutorials>`_).
Here you can find a tutorial with a Jupyter notebook illustrating some common use cases in the project source code.
Make sure to download the tutorial folder to follow the examples in this CLI tutorial.

you can also use the Python API directly (see `tutorials <https://github.com/CDDLeiden/QSPRpred/tree/main/tutorials>`_).

CLI Example
***********
Expand Down
55 changes: 51 additions & 4 deletions docs/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Overview of available features
* :class:`~qsprpred.data.descriptors.fingerprints.AtomPairFP`: AtomPairFP
* :class:`~qsprpred.data.descriptors.fingerprints.AvalonFP`: AvalonFP
* :class:`~qsprpred.data.descriptors.fingerprints.LayeredFP`: LayeredFP
* :class:`~qsprpred.data.descriptors.fingerprints.MACCsFP`: MACCsFP
* :class:`~qsprpred.data.descriptors.fingerprints.MaccsFP`: MaccsFP
* :class:`~qsprpred.data.descriptors.fingerprints.MorganFP`: MorganFP
* :class:`~qsprpred.data.descriptors.fingerprints.PatternFP`: PatternFP
* :class:`~qsprpred.data.descriptors.fingerprints.RDKitFP`: RDKitFP
Expand Down Expand Up @@ -161,7 +161,7 @@ Overview of available features

.. dropdown:: Model Assessors

:class:`~qsprpred.models.assessment_methods.ModelAssessor`: Base class for model assessors.
:class:`~qsprpred.models.assessment.methods.ModelAssessor`: Base class for model assessors.

Model assessors are used to assess the performance of models.
More information be found in the `model assessment tutorial <https://github.com/CDDLeiden/QSPRpred/blob/main/tutorials/basics/modelling/model_assessment.ipynb>`_.
Expand All @@ -170,8 +170,8 @@ Overview of available features

.. tab-item:: Core

* :class:`~qsprpred.models.assessment_methods.CrossValAssessor`: CrossValAssessor
* :class:`~qsprpred.models.assessment_methods.TestSetAssessor`: TestSetAssessor
* :class:`~qsprpred.models.assessment.methods.CrossValAssessor`: CrossValAssessor
* :class:`~qsprpred.models.assessment.methods.TestSetAssessor`: TestSetAssessor

.. dropdown:: Hyperparameter Optimizers

Expand Down Expand Up @@ -209,3 +209,50 @@ Overview of available features
* :class:`~qsprpred.plotting.classification.MetricsPlot`: MetricsPlot
* :class:`~qsprpred.plotting.classification.ConfusionMatrixPlot`: ConfusionMatrixPlot

.. dropdown:: Monitors

* :class:`~qsprpred.models.monitors.FitMonitor`: Base class for monitoring model fitting
* :class:`~qsprpred.models.monitors.AssessorMonitor`: Base class for monitoring model assessment (subclass of :class:`~qsprpred.models.monitors.FitMonitor`)
* :class:`~qsprpred.models.monitors.HyperparameterOptimizationMonitor`: Base class for monitoring hyperparameter optimization (subclass of :class:`~qsprpred.models.monitors.AssessorMonitor`)

Monitors are used to monitor the training of models.
More information can be found in the `model monitoring tutorial <https://github.com/CDDLeiden/QSPRpred/blob/main/tutorials/advanced/modelling/monitoring.ipynb>`_.

.. tab-set::

.. tab-item:: Core

* :class:`~qsprpred.models.monitors.NullMonitor`: NullMonitor
* :class:`~qsprpred.models.monitors.ListMonitor`: ListMonitor
* :class:`~qsprpred.models.monitors.BaseMonitor`: BaseMonitor
* :class:`~qsprpred.models.monitors.FileMonitor`: FileMonitor
* :class:`~qsprpred.models.monitors.WandBMonitor`: WandBMonitor

.. dropdown:: Scaffolds

:class:`~qsprpred.data.chem.scaffolds.Scaffold`: Base class for scaffolds.

Class for calculating molecular scaffolds of different kinds

.. tab-set::

.. tab-item:: Core

* :class:`~qsprpred.data.chem.scaffolds.Murcko`: Murcko
* :class:`~qsprpred.data.chem.scaffolds.BemisMurcko`: BemisMurcko

.. dropdown:: Clustering

:class:`~qsprpred.data.chem.clustering.MoleculeClusters`: Base class for clustering molecules.

Classes for clustering molecules

.. tab-set::

.. tab-item:: Core

* :class:`~qsprpred.data.chem.clustering.RandomClusters`: RandomClusters
* :class:`~qsprpred.data.chem.clustering.ScaffoldClusters`: ScaffoldClusters
* :class:`~qsprpred.data.chem.clustering.FPSimilarityClusters`: FPSimilarityClusters
* :class:`~qsprpred.data.chem.clustering.FPSimilarityMaxMinClusters`: FPSimilarityMaxMinClusters
* :class:`~qsprpred.data.chem.clustering.FPSimilarityLeaderPickerClusters`: FPSimilarityLeaderPickerClusters
4 changes: 2 additions & 2 deletions qsprpred/benchmarks/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ def getSeedList(self, seed: int | None = None) -> list[int]:
"""
Get a list of seeds for the replicas from one 'master' randomSeed.
The list of seeds is generated by sampling from the range of
possible seeds (0 to 2^32 - 1) with the given randomSeed as the random
possible seeds (0 to 2**31-1) with the given randomSeed as the random
randomSeed for the random module. This means that the list of seeds
will be the same for each run of the benchmarking experiment
with the same 'master' randomSeed. This is useful for reproducibility,
Expand All @@ -284,7 +284,7 @@ def getSeedList(self, seed: int | None = None) -> list[int]:
"""
seed = seed or self.settings.random_seed
random.seed(seed)
return random.sample(range(2**32 - 1), self.nRuns)
return random.sample(range(2**31), self.nRuns)

def iterReplicas(self) -> Generator[Replica, None, None]:
"""Generator that yields `Replica` objects for each benchmarking run.
Expand Down
Loading

0 comments on commit 733a106

Please sign in to comment.