Skip to content

Improve performance of CI system #1407

Closed
@consideRatio

Description

@consideRatio

We have crippled our CI systems performance after introducing support for arm64 based images. A key reason for this is that emulation of arm64 images from the amd64 based runners github provide is far worse, besides the fact that we end up building base-notebook and minimal-notebook for arm64 in sequence alongside the other images now.

I'm not fully sure how we should optimize this long run, but under the assumption that we will have high performance self-hosted arm64 based GitHub Action runners that can work in parallel to the amd64 runners. Below is an overview of a very optimized system, where several parts can be done separately.

  1. Nightly builds
    We have nightly builds with :nightly-amd64 and nightly-arm64 tags

  2. amd64 / arm64 in parallel
    All tests for amd64 and arm64 run in parallel, relying on nightly-amd64 and nightly-arm64 caches

  3. Images in parallel where possible
    All tests for individual images are run in a dedicated job that needs its base image job to complete.

    Some images can run in parallel:

    • base
    • minimal
    • scipy | r
    • tensorflow | datascience | pyspark
    • all-spark
  4. Avoid rebuilds when merging
    Tests finish by updating a github container registry associated with a PR. By doing so, our publishing job on merge to master can opt to use the images as they were built during tests if they are considered fresh enough.

  5. Parallel manifest creation
    Merge to default branch triggers manifest creation jobs on both amd64 and arm64. If we opt to not optimize using step 4 then this could also build fresh images using nightly cache first.

  6. Combine manifests into one before pushing to official registry
    Merge to default branch triggers a job that pulls both the amd64 image and arm64 image and defines a combined docker manifest which is then pushed to our official container registry. I think this could be done with something like docker manifest create <name of combined image> <amd64 only image> <arm64 only image> but @manics knows more and I lack experience with this.

Standalone performance issue

This standalone issue will go away by using better strategies like above. It isn't so critical to fix either I'd say. But currently, we build minimal-notebook again without using cache during push-multi assuming push-multi for base-notebook has already run. It is because we re-tag jupyter/base-notebook:latest I think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:ArmIssue specific to arm architecturetype:EnhancementA proposed enhancement to the docker images

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions