Skip to content

[WIP] Test suite to detect changes that break loading of models (Issue #458) #519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

vpratz
Copy link
Collaborator

@vpratz vpratz commented Jun 21, 2025

This PR contains the draft for a test suite that enables testing model saving and loading between different commits (see #458). The problem is somewhat complex, so I'll detail my perceived requirements below. As always, there are multiple ways to tackle this. If you have ideas for improvement, or prefer to have a different design, please comment below. As always, suggestions regarding naming are highly welcome.

Goals & Requirements

Goal:

  • detecting unintentional breaking changes that prevent loading models from an older version in a newer version
  • we do not care about differences in training etc., only if the model produces still the correct output

Requirements - Usability:

  • compare compatibility of arbitrary revisions (after the introduction of the test suite)
  • clean installs in virtual environments

Requirements - Maintainability:

  • easy specification of many variants, so that each class can be tested with different settings
  • semi-automatic file management

Solutions

We want to ensure that after loading, the model produces the same result (log_prob, samples, ...) as before. We can use the following intuitive approach to (approximately) ensure this is the case:

  1. Saving (in the older revision)
    • create, build and potentially train the model, so that all state that we expect to change is modified
    • save the model to disk
    • make a call to deterministic functions (e.g., log_prob) of the model that depend on the state, and store the inputs and outputs to disk
  2. Loading (in the newer revision)
    • load the model, the function inputs and the outputs from disk
    • try to reproduce the outputs as in 1. and compare the results
      -> any deviations and errors indicate breaking changes

We have to do this for every class that contains configuration and state that can be stored and loaded from disk. We can ignore utility functions, diagnostics, workflows (as long as they are not serializable), ...

Handling virtual environments

tox and nox are commonly used tools for handling virtual environments for test sessions. We already use tox in the project, but I didn't succeed in setting up the required environments dynamically using only configuration files. I therefore opted for the more flexible nox, which allows setting up environments via a Python API. This allows for flexible pre-processing of the inputs. If anyone sees how we can achieve the same in tox, there would be no reason against switching.

The code can be found in noxfile.py. It converts the provided git commits/branches/tags to revisions, installs the necessary bayesflow version in a virtual environment and launches the test session.

Tests

The tests are (more or less) ordinary pytest tests, that can also run as a part of the ordinary test suite (currently in the folder tests/test_compatibility). In this case, they will just store the output to a temporary directory, and perform a (same-version) save and load test. To avoid repetition, and keep the tests readable, I would propose the following structure:

  • class-based tests, with one class for each kind of object we want to test (summary network, inference network, adapter, ...)
  • each class inherits from a base class, SaveLoadTest, which provides basic file handling utilities to convert filenames to unique, fixed paths
  • each class is structured according to the setting above:
    • a setup method (fixture) stores everything to disk (save) or loads everything from disk (load)
    • a testmethod consumes the loaded model and data from the setup, runs the inputs through the functions and compares the output
    • as both setup and test call the same functions to the same output, it makes sense to create an evaluate function which contains the computations
  • as setup is a fixture, it can be parameterized as usual with pytest.mark.parametrize and consume other fixtures

Parametrization

Until now, we rely on parametrizing fixtures in most places of the test suite, which are then applied in all possible permutations. This can be limiting when not all values of one fixture are compatible with all values of a different fixture. @pytest.mark.parametrize allows specifying the desired combinations only (if used in one decorator), as well as permutations (when using multiple decorators).

I have also decided to not use the request.getfixturevalue(request.param) pattern, as it only works reliably with fixtures without inputs. Instead, I have opted for a match ... case, which takes a name and kwargs, similar to the dispatch functions. In some cases, we can even simply call the respective find_... dispatch function. Passing kwargs has the advantage that any parameter combination can be specified without creating a large number of fixtures. The drawback is that they cannot be as flexibly parametrized. If this is required, we might need to write/include tooling for this.

Another change was to remove the "session"-wide fixtures, to make fixtures overrideable in the more specific conftest.py.

Please take a look at the tests in tests/test_compatibility for examples.

Usage

First, install nox via pip install -e .[test]. To save models and outputs from one version (e.g., v2.0.4), run

nox -- save v2.0.4

This installs version 2.0.4 into a virtual env, clones the currently checked out tests folder into a temporary directory, and runs pytest on the tests/test_compatibility folder. The result is stored in the folder _compatibility_data.

To load the data with the dev branch, you can then run:

nox -- load dev --from v2.0.4

To use the local state instead of a commit/branch/tag, just use ..

In the end, the workflow between releases would be:

git checkout <old_release>
nox -- save <old_release>
nox -- load <new_release> --from <old_release>

We can do git checkout <new_release> as well, but will then see some tests fail if they test functionality not present in <old_release>.

Downsides

The main downside of the current approach is a duplication between the "ordinary" tests and the compatibility tests. As they have somewhat different constraints (the ordinary test suite takes a long time, so we cannot tests as many variants), this might be necessary, but we can also think about a closer integration of the two. I'm not sure yet what is best here...

There are still some things missing, mainly documentation, I will add those when/if we have decided on an implementation.

This has become quite a long explanation, but the whole topic is a bit unwieldy. I'm looking forward to your thoughts and opinions. This is not an urgent PR, so just take a look when you find the time...

@vpratz vpratz added the draft Draft Pull Request, Work in Progress label Jun 21, 2025
Copy link

codecov bot commented Jun 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

see 10 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
draft Draft Pull Request, Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants