Description
We have crippled our CI systems performance after introducing support for arm64 based images. A key reason for this is that emulation of arm64 images from the amd64 based runners github provide is far worse, besides the fact that we end up building base-notebook and minimal-notebook for arm64 in sequence alongside the other images now.
I'm not fully sure how we should optimize this long run, but under the assumption that we will have high performance self-hosted arm64 based GitHub Action runners that can work in parallel to the amd64 runners. Below is an overview of a very optimized system, where several parts can be done separately.
-
Nightly builds
We have nightly builds with:nightly-amd64
andnightly-arm64
tags -
amd64 / arm64 in parallel
All tests for amd64 and arm64 run in parallel, relying onnightly-amd64
andnightly-arm64
caches -
Images in parallel where possible
All tests for individual images are run in a dedicated job thatneeds
its base image job to complete.Some images can run in parallel:
- base
- minimal
- scipy | r
- tensorflow | datascience | pyspark
- all-spark
-
Avoid rebuilds when merging
Tests finish by updating a github container registry associated with a PR. By doing so, our publishing job on merge to master can opt to use the images as they were built during tests if they are considered fresh enough. -
Parallel manifest creation
Merge to default branch triggers manifest creation jobs on both amd64 and arm64. If we opt to not optimize using step 4 then this could also build fresh images using nightly cache first. -
Combine manifests into one before pushing to official registry
Merge to default branch triggers a job that pulls both the amd64 image and arm64 image and defines a combined docker manifest which is then pushed to our official container registry. I think this could be done with something likedocker manifest create <name of combined image> <amd64 only image> <arm64 only image>
but @manics knows more and I lack experience with this.
Standalone performance issue
This standalone issue will go away by using better strategies like above. It isn't so critical to fix either I'd say. But currently, we build minimal-notebook again without using cache during push-multi
assuming push-multi
for base-notebook
has already run. It is because we re-tag jupyter/base-notebook:latest I think.