Skip to content

Improve efficiency of CI checks (so we can add MORE!) #13845

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

There is a tension between adding more tests and and PRs and code velocity (more tests --> longer CI)

DataFusion runs a many tests on every change to every PR. For example my most recent PR ran 24 tests (link) consuming over an hour of worker time.

This has several challenges

  • Increases CI cycle time and thus how fast contributors get feedback on their PRs
  • Limits new test we can add: As we contemplate adding even more testing (which we should) such as the sqllogictest suite adding them to CI would make the time even worse
  • wastes worker credits (though we don't run out of credits largely because the Apache Software Foundation has approximately unlimited time from github, but there are broader concerns, like the amount of CO2 generated by that waste 🙈 )

Another observation is that there are several tests also rarely fail in PRs, but offer important coverage such as the Windows and mac tests. We even disabled the Windows test due to its (lack of) speed.

Describe the solution you'd like

I would like to improve the efficiency of existing CI jobs as well as have a mechanism to run both new and existing tests that offer important coverage but take too long to run on each CI

Describe alternatives you've considered

Here are some options from my past lives:

Option 1: Change some tests to only run on merges to main

In this option, we could change some checks to only run on merges to main. For example, we could run the windows tests only on merges to main rather than also on PRs.

Instead of all jobs being triggered like

# trigger for all PRs and changes to main
on:
  push:
    branches:
      - main
  pull_request:

we would change some tests to run like

# trigger for only changes to main
on:
  push:
    branches:
      - main

pros:

  • simple to implement (change github branch matching rules)
    cons:
  • PRs could merge to main that broke the tests, so the tests would be brokenand someone would have to chase / fix the tests
  • Someone (maintainers?) would have to traige and fix these tests

We could probably add some sort of job that would automatically make revert PRs for any code that broke the main tests to help triage

Option 2: Implement more sophisticated merge flow

In this option, I would imagine a workflow like

  1. Open a PR
  2. Subset of tests pass
  3. PR is "marked for merge" somehow
  4. Additional longer CI checks run and if pass PR is merged, if fail PR is "unmarked for merge"

This might already be supported by github in Merge Queues

There are probably bots like https://github.com/rust-lang/homu that could automate something like this

pros:

  • more automation / less human intervention
    cons:
  • More complicated to setup / maintain

Option 3: Your idea here

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions