-
Notifications
You must be signed in to change notification settings - Fork 227
Remove AD backend loops in test suite #2564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Turing.jl documentation for PR #2564 is available at: |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2564 +/- ##
=======================================
Coverage 85.86% 85.86%
=======================================
Files 21 21
Lines 1429 1429
=======================================
Hits 1227 1227
Misses 202 202 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Pull Request Test Coverage Report for Build 15325551548Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
cc047e3
to
62d414f
Compare
ad11da9
to
d581efc
Compare
f3e61d5
to
0975a91
Compare
739eb97
to
f59a89c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I, too, am a bit nervous about how much this reduces our test coverage (and I don't mean lines of code but functionality), but it definitely seems like the right way to do things long term.
One concern I have is if some bits of code that aren't within logdensity_and_gradient
calls might still somehow fail, due to e.g. not handling tracer types. For instance, Gibbs does a bunch of resetting of sampler states, and nasty things could happen if some vector element types get messed up. run_ad
wouldn't catch that, since it evaluates the LDF without calling step
. Maybe in the loop over algorithms in ad.jl
we could also do a call to sample
with each AD backend and model, just to ensure that it doesn't crash?
Turing.Gibbs(@varname(s) => HMC(0.01, 4), @varname(m) => MH()), | ||
Turing.Gibbs(@varname(s) => MH(), @varname(m) => HMC(0.01, 4)), | ||
] | ||
|
||
if !has_dot_assume(model) | ||
# Add in some MH samplers, which are not compatible with `.~`. | ||
append!( | ||
samplers, | ||
[ | ||
Turing.Gibbs(@varname(s) => HMC(0.01, 4), @varname(m) => MH()), | ||
Turing.Gibbs(@varname(s) => MH(), @varname(m) => HMC(0.01, 4)), | ||
], | ||
) | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also removed this (outdated) check for dot assume so now it checks with every sampler combination
I am not very concerned with the potential coverage lost - we can add tests back when things fail. (Is this missed?) |
Also, we can probably get rid of the Line 376 in 5e7ed7d
Line 22 in 5e7ed7d
Turing.jl/test/mcmc/Inference.jl Line 23 in 5e7ed7d
|
To be completely honest, I have no idea what that test is doing or why it's using a different adtype. I didn't want to touch it in this PR, but let's put it this way, if someone else were to put in a PR removing this test, I would probably approve...
We could, or we could just drop the adtype keyword argument entirely and let it default to the default (i.e. just use |
Closes #2307
Summary
In Turing's test suite, we have historically run HMC, etc. tests three times, once with ForwardDiff, once with ReverseDiff, and once with Mooncake. This PR removes this loop and only runs sampler-based tests with ForwardDiff.
In its place, I've moved all the AD-based tests to
test/ad.jl
. These comprise:ADTypeCheckContext tests
These are directly copied from the old
test/test_utils/ad_utils.jl
file, and basically exist to make sure that samplers and optimisers don't suddenly default to using ForwardDiff.SamplingContext tests with DynamicPPL demo models
When running
sample(model, spl, N)
, the LogDensityFunction gets constructed with a context ofSamplingContext(rng, spl)
, which means that model execution can in general go down a completely different path when compared to just using DefaultContext. Thus, there is still a need to test the correctness of AD on this.GibbsContext tests with DynamicPPL demo models
The same rationale as above, but for Gibbs sampling, which contextualises the model with GibbsContext.
Are the demo models enough?
Given that we would previously have been testing AD on a range of random models in the inference tests, it might be fair to ask the question of whether we are losing AD testing coverage by not testing on all of the models.
I've (rather painstakingly) gone through all of the models in the inference tests, and it turns out that most of them are not very interesting. For example,
gdemo_default
is used a lot, and that's already one of the DynamicPPL demo models.Other models are speciically chosen to highlight issues with DynamicPPL internals rather than AD, so I don't really consider losing the AD testing on these to be a huge problem.
Most of the non-trivial Gibbs models have discrete parameters and so obviously aren't suitable for general AD testing.
There were three models where (in my best judgment) were worth adding some tests for, so I've already transferred these to ADTests: TuringLang/ADTests@4c34c19 (I am aware that ADTests is not really a 'test' in the truest sense, as it doesn't get run in CI.)
More generally @yebai has made the point that for AD testing in CI, we can stick to just the DynamicPPL demo models. Although I do still have slight concerns that that they don't offer complete coverage, I do think that they do cover a pretty wide range of DynamicPPL features, so am happy to sign off on this PR, in the interests of not letting the perfect be the enemy of the good.
What next?
This PR roughly halves the time taken to run Gibbs tests, which are the slowest tests.
I think it would be a very quick win to try to rebalance the test suite in CI, especially if we can split the Gibbs tests up. Right now the AD tests are being run together with 'everything else', which is not optimal.
I also note that the Gibbs demo model tests take an hour to run on 1.10, but around 15 minutes on 1.11. This is a huge source of the 90-minute runtime for the slowest tests (Gibbs on 1.10). I don't know why this is the case. In an ideal world, I'd like somebody to look into it, although we don't have much time.