Test infra code freeze to aid in stability for the kubernetes release #907
Description
This was brought up some time ago and was discussed during this release.
The idea is as follows: the release team relies on CI to gauge the quality of Kubernetes throughout the cycle.
CI jobs themselves are dependent on the reliability of test infra (i.e., Prow, Boskos).
During 1.16, just when we were getting ready to go into code freeze, Boskos started failing in such a way that all CI jobs began to flake rather frequently for an entire day and then converted into constant run failures.
The biggest issue is that it took us days just to identify that there was indeed an issue and that Boskos was responsible for it.
Due to this, the idea of applying a code freeze on test infra, which would essentially result in Kubernetes being tested with the same version of Prow, Boskos, etc. throughout the duration of code freeze. The goal was to separate the kubernetes signal from the test-infra signal.
This idea was brought up during a sig testing call, however the recording has not been made available due to technical difficulties and was subsecuently pursued in slack in various threads.
The original conversation was had here https://kubernetes.slack.com/archives/C09QZ4DQB/p1570558307210700
and these were some of the responses to the idea:
I am not convinced a test-infra code freeze will achieve the goal of more stability.
The boskos update was done in an unusual way with the belief it was a noop. Updating boskos regularly with the rest of prow is more likely to have a positive impact (which we started doing after code freeze).
The other issues I'm aware of -- such as nodes going unhealthy -- tend to not be related to pushes, so not pushing things won't help here.
on-call was actually engaged more or less immediately in the boskos case, but since we were looking at just a couple of test failures we ended up dismissing it as a flake (because this is, after all, Kubernetes) until we came back around to it a day later.
Once we figured out it was a boskos problem we were more than happy to roll back, but it was difficult to figure out a) to what, and b) that it was safe (edited)
We could have responded a day faster if we had alerting on e.g. boskos returning 500 errors
another question: how often is prow updated?
Daily.
(at the time boskos was not, and was then inadvertently updated to a random old version. it now follows the rest of prow)
ah ok. wanted to see how much of a slow down a code freeze would be (they usually last ~2 weeks)
Freezing prow deployments carries some risk — development would not stop, but we would suddenly receive two weeks of deployments at once instead of one day’s worth
I think it unlikely a complete code freeze would fly — there are a lot of non-Kubernetes developers and deployments (edited)yeah, code freeze doesnt sound goon with that in mind. though I guess the big thing right now is this: "on-call was actually engaged more or less immediately in the boskos case, but since we were looking at just a couple of test failures we ended up dismissing it as a flake (because this is, after all, Kubernetes) until we came back around to it a day later."
code freeze doesnt sound like a good solution, but how could we quicken the response time in the future?
how sane is it to assume that a failure on a prow-related component should merit debug-time (not be dismissed)?have tests that are less flaky and run faster so we don’t assume all failures are just test flakes? 😛
it seems that the other major takeaway is "standardize boskos deployment to the rest of prow" which is done, yes?
we did that, yes
this specific failure mode is unlikely in the future
I think the easiest way to demonstrate that things have actually gotten worse is just being able to link to a bunch of failures, and perhaps some graph showing that there are now more failures
ideally nobody has to do this because we have monitoring and alerting that catches it (edited)
(currently we do not.)
For the curious, these were the two follow ups we had
- https://kubernetes.slack.com/archives/C2C40FMNF/p1571678658295300
- https://kubernetes.slack.com/archives/C2C40FMNF/p1571679294297800
Some interesting takeaways from these last convo:
How is prow/boskos/plank/etc tested before running in production?
"it isn't"
Or more accurately, the battery of unit tests and (hopefully) some sort of local testing by the developer.
but, as of today, prow is very much tested in prod.
(specifically it's tested on prow.k8s.io, with every other deployment lagging behind. which is not ideal.)
And in general we've rehashed this conversation a million times and should write a document on why a staging environment or end to end tests are prohibitively difficult...
f we stop updating everything for two weeks:
- we are still vulnerable to infra issues (1.15) and possibly also mistakes (1.16 - we didn't think we'd touched boskos at all)
- when we update things, we suddenly drop two weeks of changes at once and it is very likely that we introduce new bugs and have a hard time finding them
Things go awry when something in the system fails. Bumping is one small trigger for a failure. The current level of metrics measurement and alerting fails to catch failure states, often.
Focusing on instrumentation for alerts instead of a slower bump cadence to me seems much more useful.
We used to email kubernetes-dev when things were busted, but that has fallen by the wayside since spiffxp went on leave
we should probably pick that up again
I think one thing I mentioned that may or may not be useful are more formal postmortem documents. Identifying root causes, why automated tooling couldn't or didn't measure the system degrading and fire off alerts.
I apologize for the extensive wall of test but some areas of work that came from the discussion are the following:
- write a document on why a staging environment or end to end tests are prohibitively difficult for test-infra components. Contributors should be able to understand how we guarantee the quality of our services/tooling.
- Focusing on instrumentation for alerts instead of a slower bump cadence to me seems much more useful.
- send email kubernetes-dev when things were busted, Make sure we have postmortens
/sig release
/sig testing
/priority important-longterm
/milestone v1.18
/cc @kubernetes/release-engineering @kubernetes/release-team