Skip to content

Remove AD backend loops in test suite #2564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 29, 2025
Merged

Remove AD backend loops in test suite #2564

merged 21 commits into from
May 29, 2025

Conversation

penelopeysm
Copy link
Member

@penelopeysm penelopeysm commented May 23, 2025

Closes #2307

Summary

In Turing's test suite, we have historically run HMC, etc. tests three times, once with ForwardDiff, once with ReverseDiff, and once with Mooncake. This PR removes this loop and only runs sampler-based tests with ForwardDiff.

In its place, I've moved all the AD-based tests to test/ad.jl. These comprise:

ADTypeCheckContext tests

These are directly copied from the old test/test_utils/ad_utils.jl file, and basically exist to make sure that samplers and optimisers don't suddenly default to using ForwardDiff.

SamplingContext tests with DynamicPPL demo models

When running sample(model, spl, N), the LogDensityFunction gets constructed with a context of SamplingContext(rng, spl), which means that model execution can in general go down a completely different path when compared to just using DefaultContext. Thus, there is still a need to test the correctness of AD on this.

GibbsContext tests with DynamicPPL demo models

The same rationale as above, but for Gibbs sampling, which contextualises the model with GibbsContext.

Are the demo models enough?

Given that we would previously have been testing AD on a range of random models in the inference tests, it might be fair to ask the question of whether we are losing AD testing coverage by not testing on all of the models.

I've (rather painstakingly) gone through all of the models in the inference tests, and it turns out that most of them are not very interesting. For example, gdemo_default is used a lot, and that's already one of the DynamicPPL demo models.

Other models are speciically chosen to highlight issues with DynamicPPL internals rather than AD, so I don't really consider losing the AD testing on these to be a huge problem.

Most of the non-trivial Gibbs models have discrete parameters and so obviously aren't suitable for general AD testing.

There were three models where (in my best judgment) were worth adding some tests for, so I've already transferred these to ADTests: TuringLang/ADTests@4c34c19 (I am aware that ADTests is not really a 'test' in the truest sense, as it doesn't get run in CI.)

More generally @yebai has made the point that for AD testing in CI, we can stick to just the DynamicPPL demo models. Although I do still have slight concerns that that they don't offer complete coverage, I do think that they do cover a pretty wide range of DynamicPPL features, so am happy to sign off on this PR, in the interests of not letting the perfect be the enemy of the good.

What next?

This PR roughly halves the time taken to run Gibbs tests, which are the slowest tests.

I think it would be a very quick win to try to rebalance the test suite in CI, especially if we can split the Gibbs tests up. Right now the AD tests are being run together with 'everything else', which is not optimal.

I also note that the Gibbs demo model tests take an hour to run on 1.10, but around 15 minutes on 1.11. This is a huge source of the 90-minute runtime for the slowest tests (Gibbs on 1.10). I don't know why this is the case. In an ideal world, I'd like somebody to look into it, although we don't have much time.

@penelopeysm penelopeysm marked this pull request as draft May 23, 2025 11:37
Copy link
Contributor

Turing.jl documentation for PR #2564 is available at:
https://TuringLang.github.io/Turing.jl/previews/PR2564/

Copy link

codecov bot commented May 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.86%. Comparing base (dea5d19) to head (f1263bf).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2564   +/-   ##
=======================================
  Coverage   85.86%   85.86%           
=======================================
  Files          21       21           
  Lines        1429     1429           
=======================================
  Hits         1227     1227           
  Misses        202      202           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coveralls
Copy link

coveralls commented May 23, 2025

Pull Request Test Coverage Report for Build 15325551548

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 85.924%

Totals Coverage Status
Change from base Build 15324242745: 0.0%
Covered Lines: 1227
Relevant Lines: 1428

💛 - Coveralls

@penelopeysm penelopeysm changed the title [WIP] Remove AD backend loops [WIP] Remove AD backend loops in test suite May 23, 2025
@penelopeysm penelopeysm force-pushed the py/no-repeated-adtype branch from cc047e3 to 62d414f Compare May 23, 2025 15:50
@penelopeysm penelopeysm force-pushed the py/no-repeated-adtype branch from ad11da9 to d581efc Compare May 24, 2025 15:49
@penelopeysm penelopeysm force-pushed the py/no-repeated-adtype branch from f3e61d5 to 0975a91 Compare May 24, 2025 20:04
@penelopeysm penelopeysm changed the title [WIP] Remove AD backend loops in test suite Remove AD backend loops in test suite May 28, 2025
@penelopeysm penelopeysm marked this pull request as ready for review May 28, 2025 00:16
@penelopeysm penelopeysm requested review from mhauru and sunxd3 May 28, 2025 00:19
@penelopeysm penelopeysm force-pushed the py/no-repeated-adtype branch from 739eb97 to f59a89c Compare May 28, 2025 00:41
Copy link
Member

@mhauru mhauru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, too, am a bit nervous about how much this reduces our test coverage (and I don't mean lines of code but functionality), but it definitely seems like the right way to do things long term.

One concern I have is if some bits of code that aren't within logdensity_and_gradient calls might still somehow fail, due to e.g. not handling tracer types. For instance, Gibbs does a bunch of resetting of sampler states, and nasty things could happen if some vector element types get messed up. run_ad wouldn't catch that, since it evaluates the LDF without calling step. Maybe in the loop over algorithms in ad.jl we could also do a call to sample with each AD backend and model, just to ensure that it doesn't crash?

Comment on lines +605 to -634
Turing.Gibbs(@varname(s) => HMC(0.01, 4), @varname(m) => MH()),
Turing.Gibbs(@varname(s) => MH(), @varname(m) => HMC(0.01, 4)),
]

if !has_dot_assume(model)
# Add in some MH samplers, which are not compatible with `.~`.
append!(
samplers,
[
Turing.Gibbs(@varname(s) => HMC(0.01, 4), @varname(m) => MH()),
Turing.Gibbs(@varname(s) => MH(), @varname(m) => HMC(0.01, 4)),
],
)
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed this (outdated) check for dot assume so now it checks with every sampler combination

@penelopeysm penelopeysm requested a review from mhauru May 28, 2025 15:47
@sunxd3
Copy link
Member

sunxd3 commented May 29, 2025

I am not very concerned with the potential coverage lost - we can add tests back when things fail.
The changes makes a lot of sense for me. All look good.

(Is this missed?)

@sunxd3
Copy link
Member

sunxd3 commented May 29, 2025

Also, we can probably get rid of the adbackend and just use Turing.DEFAULT_ADTYPE? Not essential.

adbackend = Turing.DEFAULT_ADTYPE

adbackend = Turing.DEFAULT_ADTYPE

adbackend = Turing.DEFAULT_ADTYPE

@penelopeysm
Copy link
Member Author

(Is this missed?)

To be completely honest, I have no idea what that test is doing or why it's using a different adtype.

I didn't want to touch it in this PR, but let's put it this way, if someone else were to put in a PR removing this test, I would probably approve...

Also, we can probably get rid of the adbackend and just use Turing.DEFAULT_ADTYPE?

We could, or we could just drop the adtype keyword argument entirely and let it default to the default (i.e. just use NUTS() rather than NUTS(; adtype=Turing.DEFAULT_ADTYPE). I guess it's a question of whether to be explicit or not -- in this case I erred on being explicit. But honestly, it'd be a lot cleaner to not have it in there, so I'll remove them and then merge.

@penelopeysm penelopeysm merged commit 1a70627 into main May 29, 2025
40 of 46 checks passed
@penelopeysm penelopeysm deleted the py/no-repeated-adtype branch May 29, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactoring AD Tests
4 participants