Optimize CI

As of this writing, our CI tests (specified in `.github/workflows/ci.yml`) take [~5.5m to run end-to-end](https://github.com/google/zerocopy/actions/runs/9149665633) during PR development and [~19.5m to run end-to-end](https://github.com/google/zerocopy/actions/runs/9149612263) in the merge queue. This significantly affects developer velocity, especially when developing a sequence of features which stack (ie, one PR needs to land before the next PR can be seriously considered).

This task tracks optimizing our end-to-end CI latency. Anything is on the table!

Note that both the PR latency and the merge queue latency are on the table. The PR latency is obviously the more important metric, since PR tests may run multiple times during PR development. However, given that GitHub has no automated way to merge a stack of PRs, we often have to actively keep an eye on the merge queue in order to know when we can kick off the next PR's merge. For this reason, merge queue latency is important as well.

## Advice

As of this writing, we skip 5 out of 7 build targets and all Miri tests during PR development. Thus, the merge queue CI tests have somewhat different performance characteristics than PR CI tests.

In my own investigations, I've discovered the following:
- In the merge queue, the bottleneck seems to be the `build_test` job, which encompasses the primary test matrix (there are other ancillary jobs such as `kani`, `check_fmt`, etc; these do not appear to be the bottleneck)
- Among individual matrix jobs, the distribution of times appears to be highly bimodal:
  - Most matrix jobs take ~1-2m to complete
  - Some matrix jobs take ~13m to complete
  - What distinguishes the two appears to be Miri tests, which are run only in the latter (~13m) group
- It also seems to take a few minutes just to spawn all of the ~200 jobs in the matrix (before they start executing)

We've already done some work to speed up Miri test execution (recently, #1307, #1308, and #1313). There is probably a lot more that could be done there.

There are probably also a lot of other optimization opportunities besides Miri; I just haven't taken the time to investigate in detail.

See also: #1312, #1314

## Failed attempts

I tried these, but found no speedup, or wasn't able to get them working:
- #1311 - no measurable speedup
- #1309 - confusing build failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CI #1310

joshlf
openedon May 19, 2024

Advice

Failed attempts

Assignees

Labels

Type

Projects

Milestone

Relationships

Development