Description
openedon May 19, 2024
As of this writing, our CI tests (specified in .github/workflows/ci.yml
) take ~5.5m to run end-to-end during PR development and ~19.5m to run end-to-end in the merge queue. This significantly affects developer velocity, especially when developing a sequence of features which stack (ie, one PR needs to land before the next PR can be seriously considered).
This task tracks optimizing our end-to-end CI latency. Anything is on the table!
Note that both the PR latency and the merge queue latency are on the table. The PR latency is obviously the more important metric, since PR tests may run multiple times during PR development. However, given that GitHub has no automated way to merge a stack of PRs, we often have to actively keep an eye on the merge queue in order to know when we can kick off the next PR's merge. For this reason, merge queue latency is important as well.
Advice
As of this writing, we skip 5 out of 7 build targets and all Miri tests during PR development. Thus, the merge queue CI tests have somewhat different performance characteristics than PR CI tests.
In my own investigations, I've discovered the following:
- In the merge queue, the bottleneck seems to be the
build_test
job, which encompasses the primary test matrix (there are other ancillary jobs such askani
,check_fmt
, etc; these do not appear to be the bottleneck) - Among individual matrix jobs, the distribution of times appears to be highly bimodal:
- Most matrix jobs take ~1-2m to complete
- Some matrix jobs take ~13m to complete
- What distinguishes the two appears to be Miri tests, which are run only in the latter (~13m) group
- It also seems to take a few minutes just to spawn all of the ~200 jobs in the matrix (before they start executing)
We've already done some work to speed up Miri test execution (recently, #1307, #1308, and #1313). There is probably a lot more that could be done there.
There are probably also a lot of other optimization opportunities besides Miri; I just haven't taken the time to investigate in detail.
Failed attempts
I tried these, but found no speedup, or wasn't able to get them working:
- [ci] Run Miri test alias models in parallel #1311 - no measurable speedup
- [ci] Run on the ubuntu-20.04-64core runner #1309 - confusing build failures