[WIP] Test suite to detect changes that break loading of models (Issue #458) #519
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the draft for a test suite that enables testing model saving and loading between different commits (see #458). The problem is somewhat complex, so I'll detail my perceived requirements below. As always, there are multiple ways to tackle this. If you have ideas for improvement, or prefer to have a different design, please comment below. As always, suggestions regarding naming are highly welcome.
Goals & Requirements
Goal:
Requirements - Usability:
Requirements - Maintainability:
Solutions
We want to ensure that after loading, the model produces the same result (
log_prob
,samples
, ...) as before. We can use the following intuitive approach to (approximately) ensure this is the case:log_prob
) of the model that depend on the state, and store the inputs and outputs to disk-> any deviations and errors indicate breaking changes
We have to do this for every class that contains configuration and state that can be stored and loaded from disk. We can ignore utility functions, diagnostics, workflows (as long as they are not serializable), ...
Handling virtual environments
tox and nox are commonly used tools for handling virtual environments for test sessions. We already use tox in the project, but I didn't succeed in setting up the required environments dynamically using only configuration files. I therefore opted for the more flexible nox, which allows setting up environments via a Python API. This allows for flexible pre-processing of the inputs. If anyone sees how we can achieve the same in tox, there would be no reason against switching.
The code can be found in
noxfile.py
. It converts the provided git commits/branches/tags to revisions, installs the necessary bayesflow version in a virtual environment and launches the test session.Tests
The tests are (more or less) ordinary pytest tests, that can also run as a part of the ordinary test suite (currently in the folder
tests/test_compatibility
). In this case, they will just store the output to a temporary directory, and perform a (same-version) save and load test. To avoid repetition, and keep the tests readable, I would propose the following structure:SaveLoadTest
, which provides basic file handling utilities to convert filenames to unique, fixed pathssetup
method (fixture) stores everything to disk (save) or loads everything from disk (load)test
method consumes the loaded model and data from the setup, runs the inputs through the functions and compares the outputsetup
andtest
call the same functions to the same output, it makes sense to create anevaluate
function which contains the computationssetup
is a fixture, it can be parameterized as usual withpytest.mark.parametrize
and consume other fixturesParametrization
Until now, we rely on parametrizing fixtures in most places of the test suite, which are then applied in all possible permutations. This can be limiting when not all values of one fixture are compatible with all values of a different fixture.
@pytest.mark.parametrize
allows specifying the desired combinations only (if used in one decorator), as well as permutations (when using multiple decorators).I have also decided to not use the
request.getfixturevalue(request.param)
pattern, as it only works reliably with fixtures without inputs. Instead, I have opted for amatch ... case
, which takes aname
andkwargs
, similar to the dispatch functions. In some cases, we can even simply call the respectivefind_...
dispatch function. Passingkwargs
has the advantage that any parameter combination can be specified without creating a large number of fixtures. The drawback is that they cannot be as flexibly parametrized. If this is required, we might need to write/include tooling for this.Another change was to remove the "session"-wide fixtures, to make fixtures overrideable in the more specific
conftest.py
.Please take a look at the tests in
tests/test_compatibility
for examples.Usage
First, install nox via
pip install -e .[test]
. To save models and outputs from one version (e.g., v2.0.4), runThis installs version 2.0.4 into a virtual env, clones the currently checked out
tests
folder into a temporary directory, and runs pytest on thetests/test_compatibility
folder. The result is stored in the folder_compatibility_data
.To load the data with the
dev
branch, you can then run:To use the local state instead of a commit/branch/tag, just use
.
.In the end, the workflow between releases would be:
We can do
git checkout <new_release>
as well, but will then see some tests fail if they test functionality not present in<old_release>
.Downsides
The main downside of the current approach is a duplication between the "ordinary" tests and the compatibility tests. As they have somewhat different constraints (the ordinary test suite takes a long time, so we cannot tests as many variants), this might be necessary, but we can also think about a closer integration of the two. I'm not sure yet what is best here...
There are still some things missing, mainly documentation, I will add those when/if we have decided on an implementation.
This has become quite a long explanation, but the whole topic is a bit unwieldy. I'm looking forward to your thoughts and opinions. This is not an urgent PR, so just take a look when you find the time...