|
| 1 | +This directory contains all of our automatically triggered workflows. |
| 2 | + |
| 3 | +# Test runner |
| 4 | + |
| 5 | +Our top level `test_runner.yml` is responsible for kicking off all tests, which |
| 6 | +are represented as reusable workflows. This is carefully constructed to satisfy |
| 7 | +the design laid out in go/protobuf-gha-protected-resources (see below), and |
| 8 | +duplicating it across every workflow file would be difficult to maintain. As an |
| 9 | +added bonus, we can manually dispatch our full test suite with a single button |
| 10 | +and monitor the progress of all of them simultaneously in GitHub's actions UI. |
| 11 | + |
| 12 | +There are five ways our test suite can be triggered: |
| 13 | + |
| 14 | +- **Post-submit tests** (`push`): These are run over newly submitted code |
| 15 | +that we can assume has been thoroughly reviewed. There are no additional |
| 16 | +security concerns here and these jobs can be given highly privileged access to |
| 17 | +our internal resources and caches. |
| 18 | + |
| 19 | +- **Pre-submit tests from a branch** (`push_request`): These are run over |
| 20 | +every PR as changes are made. Since they are coming from branches in our |
| 21 | +repository, they have secret access by default and can also be given highly |
| 22 | +privileged access. However, we expect *many* of these events per change, |
| 23 | +and likely many from abandoned/exploratory changes. Given the much higher |
| 24 | +frequency, we restrict the ability to *write* to our more expensive caches. |
| 25 | + |
| 26 | +- **Pre-submit tests from a fork** (`push_request_target`): These are run |
| 27 | +over every PR from a forked repository as changes are made. These have much |
| 28 | +more restricted access, since they could be coming from anywhere. To protect |
| 29 | +our secret keys and our resources, tests will not run until a commit has been |
| 30 | +labeled `safe to submit`. Further commits will require further approvals to |
| 31 | +run our test suite. Once marked as safe, we will provide read-only access to |
| 32 | +our caches and Docker images, but will generally disallow any writes to shared |
| 33 | +resources. |
| 34 | + |
| 35 | +- **Continuous tests** (`schedule`): These are run on a fixed schedule. We |
| 36 | +currently have them set up to run daily, and can help identify non-hermetic |
| 37 | +issues in tests that don't get run often (such as due to test caching) or during |
| 38 | +slow periods like weekends and holidays. Similar to post-submit tests, these |
| 39 | +are run over submitted code and are highly privileged in the resources they |
| 40 | +can use. |
| 41 | + |
| 42 | +- **Manual testing** (`workflow_dispatch`): Our test runner can be triggered |
| 43 | +manually over any branch. This is treated similarly to pre-submit tests, |
| 44 | +which should be highly privileged because they can only be triggered by the |
| 45 | +protobuf team. |
| 46 | + |
| 47 | +# Staleness handling |
| 48 | + |
| 49 | +While Bazel handles code generation seamlessly, we do support build systems that |
| 50 | +don't. There are a handful of cases where we need to check in generated files |
| 51 | +that can become stale over time. In order to provide a good developer |
| 52 | +experience, we've implemented a system to make this more manageable. |
| 53 | + |
| 54 | +- Stale files should have a corresponding `staleness_test` Bazel target. This |
| 55 | +should be marked `manual` to avoid getting picked up in CI, but will fail if |
| 56 | +files become stale. It also provides a `--fix` flag to update the stale files. |
| 57 | + |
| 58 | +- Bazel tests will never depend on the checked-in versions, and will generate |
| 59 | +new ones on-the-fly during build. |
| 60 | + |
| 61 | +- Non-Bazel tests will always regenerate necessary files before starting. This |
| 62 | +is done using our `bash` and `docker` actions, which should be used for any |
| 63 | +non-Bazel tests. This way, no tests will fail due to stale files. |
| 64 | + |
| 65 | +- A post-submit job will immediately regenerate any stale files and commit them |
| 66 | +if they've changed. |
| 67 | + |
| 68 | +- A scheduled job will run late at night every day to make sure the post-submit |
| 69 | +is working as expected (that is, it will run all the staleness tests). |
| 70 | + |
| 71 | +The `regenerate_stale_files.sh` script is the central script responsible for all |
| 72 | +the re-generation of stale files. |
| 73 | + |
| 74 | +# Forked PRs |
| 75 | + |
| 76 | +Because we need secret access to run our tests, we use the `pull_request_target` |
| 77 | +event for PRs coming from forked repositories. We do checkout the code from the |
| 78 | +PR's head, but the workflow files themselves are always fetched from the *base* |
| 79 | +branch (that is, the branch we're merging to). Therefore, any changes to these |
| 80 | +files won't be tested, so we explicitly ban PRs that touch these files. |
| 81 | + |
| 82 | +# Caches |
| 83 | + |
| 84 | +We have a number of different caching strategies to help speed up tests. These |
| 85 | +live either in GCP buckets or in our GitHub repository cache. The former has |
| 86 | +a lot of resources available and we don't have to worry as much about bloat. |
| 87 | +On the other hand, the GitHub repository cache is limited to 10GB, and will |
| 88 | +start pruning old caches when it exceeds that threshold. Therefore, we need |
| 89 | +to be very careful about the size and quantity of our caches in order to |
| 90 | +maximize the gains. |
| 91 | + |
| 92 | +## Bazel remote cache |
| 93 | + |
| 94 | +As described in https://bazel.build/remote/caching, remote caching allows us to |
| 95 | +offload a lot of our build steps to a remote server that holds a cache of |
| 96 | +previous builds. We use our GCP project for this storage, and configure |
| 97 | +*every* Bazel call to use it. This provides substantial performance |
| 98 | +improvements at minimal cost. |
| 99 | + |
| 100 | +We do not allow forked PRs to upload updates to our Bazel caches, but they |
| 101 | +do use them. Every other event is given read/write access to the caches. |
| 102 | +Because Bazel behaves poorly under certain environment changes (such as |
| 103 | +toolchain, operating system), we try to use finely-grained caches. Each job |
| 104 | +should typically have its own cache to avoid cross-pollution. |
| 105 | + |
| 106 | +## Bazel repository cache |
| 107 | + |
| 108 | +When Bazel starts up, it downloads all the external dependencies for a given |
| 109 | +build and stores them in the repository cache. This cache is *separate* from |
| 110 | +the remote cache, and only exists locally. Because we have so many Bazel |
| 111 | +dependencies, this can be a source of frequent flakes due to network issues. |
| 112 | + |
| 113 | +To avoid this, we keep a cached version of the repository cache in GitHub's |
| 114 | +action cache. Our full set of repository dependencies ends up being ~300MB, |
| 115 | +which is fairly expensive given our 10GB maximum. The most expensive ones seem |
| 116 | +to come from Java, which has some very large downstream dependencies. |
| 117 | + |
| 118 | +Given the cost, we take a more conservative approach for this cache. Only push |
| 119 | +events will ever write to this cache, but all events can read from them. |
| 120 | +Additionally, we only store three caches for any given commit, one per platform. |
| 121 | +This means that multiple jobs are trying to update the same cache, leading to a |
| 122 | +race. GitHub rejects all but one of these updates, so we designed the system so |
| 123 | +that caches are only updated if they've actually changed. That way, over time |
| 124 | +(and multiple pushes) the repository caches will incrementally grow to encompass |
| 125 | +all of our dependencies. A scheduled job will run monthly to clear these caches |
| 126 | +to prevent unbounded growth as our dependencies evolve. |
| 127 | + |
| 128 | +## ccache |
| 129 | + |
| 130 | +In order to speed up non-Bazel builds to be on par with Bazel, we make use of |
| 131 | +[ccache](https://ccache.dev/). This intercepts all calls to the compiler, and |
| 132 | +caches the result. Subsequent calls with a cache-hit will very quickly |
| 133 | +short-circuit and return the already computed result. This has minimal affect |
| 134 | +on any *single* job, since we typically only run a single build. However, by |
| 135 | +caching the ccache results in GitHub's action cache we can substantially |
| 136 | +decrease the build time of subsequent runs. |
| 137 | + |
| 138 | +One useful feature of ccache is that you can set a maximum cache size, and it |
| 139 | +will automatically prune older results to keep below that limit. On Linux and |
| 140 | +Mac cmake builds, we generally get 30MB caches and set a 100MB cache limit. On |
| 141 | +Windows, with debug symbol stripping we get ~70MB and set a 200MB cache limit. |
| 142 | + |
| 143 | +Because CMake build tend to be our slowest, bottlenecking the entire CI process, |
| 144 | +we use a fairly expensive strategy with ccache. All events will cache their |
| 145 | +ccache directory, keyed by the commit and the branch. This means that each |
| 146 | +PR and each branch will write its own set of caches. When looking up which |
| 147 | +cache to use initially, each job will first look for a recent cache in its |
| 148 | +current branch. If it can't find one, it will accept a cache from the base |
| 149 | +branch (for example, PRs will initially use the latest cache from their target |
| 150 | +branch). |
| 151 | + |
| 152 | +While the ccache caches quickly over-run our GitHub action cache, they also |
| 153 | +quickly become useless. Since GitHub prunes caches based on the time they were |
| 154 | +last used, this just means that we'll see quicker turnover. |
| 155 | + |
| 156 | +## Bazelisk |
| 157 | + |
| 158 | +Bazelisk will automatically download a pinned version of Bazel on first use. |
| 159 | +This can lead to flakes, and to avoid that we cache the result keyed on the |
| 160 | +Bazel version. Only push events will write to this cache, but it's unlikely |
| 161 | +to change very often. |
| 162 | + |
| 163 | +## Docker images |
| 164 | + |
| 165 | +Instead of downloading a fresh Docker image for every test run, we can save it |
| 166 | +as a tar and cache it using `docker image save` and later restore using |
| 167 | +`docker image load`. This can decrease download times and also reduce flakes. |
| 168 | +Note, Docker's load can actually be significantly slower than a pull in certain |
| 169 | +situations. Therefore, we should reserve this strategy for only Docker images |
| 170 | +that are causing noticeable flakes. |
| 171 | + |
| 172 | +## Pip dependencies |
| 173 | + |
| 174 | +The actions/setup-python action we use for Python supports automated caching |
| 175 | +of pip dependencies. We enable this to avoid having to download these |
| 176 | +dependencies on every run, which can lead to flakes. |
| 177 | + |
| 178 | +# Custom actions |
| 179 | + |
| 180 | +We've defined a number of custom actions to abstract out shared pieces of our |
| 181 | +workflows. |
| 182 | + |
| 183 | +- **Bazel** use this for running all Bazel tests. It can take either a single |
| 184 | +Bazel command or a more general bash command. In the latter case, it provides |
| 185 | +environment variables for running Bazel with all our standardized settings. |
| 186 | + |
| 187 | +- **Bazel-Docker** nearly identical to the **Bazel** action, this additionally |
| 188 | +runs everything in a specified Docker image. |
| 189 | + |
| 190 | +- **Bash** use this for running non-Bazel tests. It takes a bash command and |
| 191 | +runs it verbatim. It also handles the regeneration of stale files (which does |
| 192 | +use Bazel), which non-Bazel tests might depend on. |
| 193 | + |
| 194 | +- **Docker** nearly identical to the **Bash** action, this additionally runs |
| 195 | +everything in a specified Docker image. |
| 196 | + |
| 197 | +- **ccache** this sets up a ccache environment, and initializes some |
| 198 | +environment variables for standardized usage of ccache. |
| 199 | + |
| 200 | +- **Cross-compile protoc** this abstracts out the compilation of protoc using |
| 201 | +our cross-compilation infrastructure. It will set a `PROTOC` environment |
| 202 | +variable that gets automatically picked up by a lot of our infrastructure. |
| 203 | +This is most useful in conjunction with the **Bash** action with non-Bazel |
| 204 | +tests. |
0 commit comments