-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a generic code-inferrability test #37193
Conversation
Because of this, I am not sure this is the best way to do this. And with "this way" I mean, setting an arbitrary cap and then hard erroring CI when that is reached. This would be a bit similar to setting a hard cap on sysimage build time and failing CI when that is reached (assuming that it is completely deterministic). When that point is reached, it is likely a lot of CI will start failing and the whole thing grinds to a halt until someone with the skill goes in and pushes to number of invalidations down again. Instead, (just like compilation time), I feel like this should be tracked via some time-series functionality where it can be monitored and when some threshold is reached it might start blinking red or something but hard failing CI because someone does a small fix that pushes the number of backedges from 163 to 170 while the PR that was merged yesterday took it from 130 to 160 seems too harsh.
While true, the difficulty of fixing the issue also needs to be weighed in which in one case is trivial and in the other very advanced. |
Excellent points as usual. Maybe a separate buildbot pass then? I'm clueless about how one goes about implementing this. |
The CUDA people have a nice benchmark site based on codespeed that shows timelines for various benchmarks (https://speed.juliagpu.org/timeline/#/?exe=5,3,4&base=3+74&ben=grid&env=1&revs=50&equid=off&quarts=on&extr=on). It would be nice to hookup the daily Nanosoldier benchmarks to an instance of that and at the same time we could maybe add some compilation time benchmarks (JuliaCI/BaseBenchmarks.jl#252) but also some "static" info like number of backedges etc so that could also be tracked over time. |
That makes sense, but I worry about how many people actually look at those numbers. I don't check Julia's performance routinely. OTOH, as long as it's checked before the release perhaps that would be enough. OTOH, if we had a separate build pass, there's no way you'd miss a change although one could decide to go ahead and merge anyway. It does make sense to separate it out from "standard" tests of course. |
Maybe that is because we have no way of nicely visualizing it right now.
Well, that build pass is going to be green and ignored until it turns red and then it will still mostly be ignored. And when someone (you?) see it, then what do you do? You probably want to try to find the place where bad stuff happened but since you don't really have a pass / fail criteria the best thing is probably at that point to start building a time series to see if there are points where things suddenly got worse. You could try a bisection but if the regression is gradual, it won't necessarily give you something valuable. |
Fair enough. I'm not sure how one gets started on that though. |
Hm, @staticfloat how long do we save the logs from the buildbot? If we add something that prints to the build log, would it be easy to build up a timeseries from that? For now, perhaps we can:
That means we can still merge the PR and the code gets regularly run so it doesn't bit rot and then we can think a bit more about how exactly the "holistic" measurement should be presented / checked? |
Could we do a before/after comparison like is done e.g. for code coverage? "This PR increases invalidation susceptibility by 2.3%." or whatever. |
The buildbots clean things up after 120 days. You can do this if you want to, but note that You are the second or third person in the last month who has wanted to store timeseries data for packages. I've been thinking about hosting a timeseries database like influxdb or timescaledb for the (authenticated) Julia community. Combined with a small wrapper over HTTP (like this abandoned package), it would allow us to |
I do like the idea of a time series, but I think the more important one is to be notified when PRs add a lot of poorly-inferred calls. What I had in mind was best summarized by @martinholters (#37193 (comment)). |
xref https://github.com/julia-actions/julia-invalidations, thanks @ianshmean! That covers packages, so we just need something like that for Julia itself (based on potential invalidations like this PR). |
Is there any hope of doing something like this for Julia 1.6? We don't seem to have regressed, except there seem to be more ways to invalidate |
@timholy could you check if https://github.com/JuliaLang/julia/tree/kc/stability_preferences helps with that? |
I'm personally ok with putting in whatever you think would be helpful. I still worry a bit about a potential future PR "deadlock" when this starts to erroring and new contributors get confused about what they did wrong. But we can always disable the tests if they become disruptive so 🤷 |
How about we aim to eliminate vulnerabilities in paths leading to |
5b69038
to
9d9d7b7
Compare
9d9d7b7
to
d54b626
Compare
OK, this seems to be working (test failure seems incidental). Ready for review. I am not terribly attached to the |
I am seeing tests added in this PR failing in #38437 |
Did CI run last time 7 days ago? Maybe it had time to regress in that time? |
Hmm, I should have re-run CI, although it seems broken right now. Indeed, something has changed it. Specifically, the test that I think is the most important, the impossibility of invalidating julia> m = methods(Base.require).ms[2]
require(uuidkey::Base.PkgId) in Base at loading.jl:900
julia> mis = remove_unlikely_methodinstances(atrisk_triggers(m, first.(mivulnerabilities_exported)))
2-element Vector{MethodInstance}:
MethodInstance for convert(::Type{Nothing}, ::Any)
MethodInstance for convert(::Type{Union{}}, ::Any)
julia> using Cthulhu
julia> ascend(mis[1])
Choose a call for analysis (q to quit):
convert(::Type{Nothing}, ::Any)
setindex!(::IdDict{Dict{String, Any}, Nothing}, ::Any, ::Any)
+ push!(::IdSet{Dict{String, Any}}, ::Dict{String, Any})
+ setindex!(::IdDict{Any, Nothing}, ::Any, ::Any)
setindex!(::IdDict{MethodInstance, Nothing}, ::Any, ::Any)
+ push!(::IdSet{MethodInstance}, ::MethodInstance)
push!(::IdSet{MethodInstance}, ::Any)
> treelist!(::FoldingTrees.Node{Cthulhu.Data{MethodInstance}}, ::IOBuffer, ::Any, ::String, ::IdSet{MethodInstance}) I'm showing the tree in collapsed form, folding everything below a "reasonable" call. It's basically all IdDict/IdSet-related. The offending Line 74 in 110f125
There's a fix in #38455. |
This PR adds an overall "code quality assurance" test, checking elements of inferrability of Base and the stdlibs. It largely tests just the precompiled MethodInstances, so does not offer comprehensive coverage, but these are connected to invalidations so this offers a form of protection against regression of our recent progress.
There are several elements of the test:
Base.require
is "bullet-proof" against invalidation. AnytimeBase.require
gets invalidated, it's more than half a second before you can start loading the next package in a sequence, so invalidations here are a big problem from the standpoint of package-loading performance. Currently this is marked@test_broken
, because we still have some string-processing code with inference problems that could invalidate these methods, but a goal will be to turn these into@test
s.length(badcounts)
in the test code), as well as the average number of total dependents (backedges, and the backedges of their backedges, etc) of these methods (meancounts
in the test code). Currently this PR sets the bar at 1250 worrisome MethodInstances and 33 average dependents. If/when my current unmerged PRs make it in, these can drop to 1000 and 32, respectively. For reference, Julia 1.5 has 2030 such MethodInstances with an average of 38.5 dependents. Most of what's left to fix appears to be in the area of string processing, so with a bit more effort we may be able to drop these numbers quite dramatically.is*
functions which should generally returnBool
), and whether specific unfortunate variants of certain other methods (==
,<
, etc) have been inferred. The absence of such MethodInstances indicates that all code that calls them is well-inferred.It's worth acknowledging that this may put a new burden on contributors: not only do they need to contribute the functionality, but they need to make sure it's not dramatically worsening inference. I think this is worth having. After all, we check whitespace, and whitespace is probably not nearly as important to the overall user experience as having well-inferred, invalidation-resistant code. Nevertheless, to mitigate the extra hurdles I've tried to put some hints near places where tests might fail to help developers find resources to help them.
Closes #36393