Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Refactor GitHub Actions based Pulsar CI #14819

Merged

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Mar 23, 2022

Fixes #14401

Motivation

Improve Pulsar CI:

  • Reduce GitHub Action Runner resource consumption of Pulsar PR builds

    • Currently, Pulsar GitHub Actions workflows are consuming the majority of the shared pool of resources allocated for github.com/apache projects
    • Running the GitHub Actions workflows for a single PR to Pulsar consumes about 18-20 hours of GitHub Actions Runner VM time. This is too much.
  • Reduce lead times for Pull Request feedback by speeding up builds

    • Speeds up Pulsar development
    • Improves developer productivity since waiting times are reduced
    • Since PR feedback is faster, developers can be comfortable submitting more granular pull requests.
    • When development cycle is faster, it is easier to keep the pull request queue shorter. This has several benefits since when PRs are handled quickly, there are fewer chances for pull requests to divert from the master branch. It also reduces merge conflicts and the time wasted in resolving merge conflicts.
  • Better usability and access to test reports

    • Less time is spent in looking for the reason why a build failed

Demonstration

Modifications

  • The design goal has been to keep the build content as the same as before the refactoring. The same tests are run, but in more effective ways. This refactoring doesn't make changes to the way how test retries are handled.

  • Combine most of the Pulsar CI workflows into a single workflow called "Pulsar CI"

    • The workflows that benefit of the aggregation have been chosen.
    • the modifications reuse binary artifacts in the workflow and this reduces the resource consumption.
      • Pulsar core modules jar files are built once and reused.
      • Pulsar docker images are built once and reused
      • GitHub Actions Artifacts is used to share the files within the workflow. The capacity of GitHub Actions Artifacts depend on the GitHub account or organization type. Paid plans have more capacity.
  • Integration tests are categorized into "integration tests" and "system tests"

    • A slimmer docker image apachepulsar/java-test-image:latest is used to run the integration tests that don't depend on Pulsar Python client, Tiered storage drivers, Pulsar SQL or Pulsar Connectors.
    • The previous apachepulsar/pulsar-test-latest-version:latest image is used to run the integration tests that are categorized as "system tests".
    • The benefit of this split is that the java-test-image builds in about 6 minutes and can start the downstream integration test jobs after this. This results in faster developer feedback.
  • For debugging builds, there's configuration for exposing ssh shell access to each Build VM to the user who triggered the build ("github actor"). The ssh access is authenticated with the SSH key that the user has registered in GitHub.

    • ssh access is only active in own forks. It is not enabled in apache/pulsar because of security concerns.
    • A developer can open a PR to their own fork (for example with a single command with GH cli gh pr create --repo=githubusername/pulsar --base master --head "$(git branch --show-current)" -f) to run the build with ssh access enabled.
    • ssh access is active for the duration of the build. If the build fails, the build waits 5 minutes for a developer to connect to investigate the problem. (this behavior is not enabled in apache/pulsar)

The SSH shell access feature will make it easier to debug CI issues which don't get resolved with the information in the GitHub Actions UI. This is an important capability to have available whenever there are problems. As described above, the configuration requires to run the build in a developer's personal fork of the pulsar repository to activate the feature.

  • Fix broken configuration in .github/actions/tune-runner-vm/action.yml which was broken with PR [CI] Fix Tune Runner VM memory.swappiness no such file or directory #13252.

    • The makes Linux kernel's vm swappiness setting effectively 1 for all cgroups.
    • Helps prevent swapping when the VM is running low on memory.
  • Improve test reporting by the use of https://github.com/dorny/test-reporter . The test reports get attached to the wrong workflow because of a GitHub Actions limitation. That reduces the usability since the test reports are harder to find. test-reporter renders the Junit XML files to the GitHub Actions UI.

  • Improve test reporting by adding warning annotations about the test statistics.

    • not really warnings, but GitHub Actions doesn't seem to allow info annotations from shell scripts.
  • Use GitHub Action built-in feature to cancel duplicate build jobs:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
  • a new push to a PR will trigger a new job and this feature will be used to cancel the previous build which is obsolete
  • this solution might be more effective than the current solution to cancel duplicate jobs

Additional Context

The work in this PR was mainly done last year while working on a proof-of-concept of the GitHub Actions refactoring.
There's a Google document [Discuss] PIP Changes to GitHub Actions based Pulsar CI which describes details about some technical solutions. There's also an email thread on the dev mailing list.

The showstopper a year ago was the lack of being able to re-run a single failed job in a larger workflow.
GitHub has since then delivered this feature and no showstoppers are present.

@lhotari
Copy link
Member Author

lhotari commented Mar 23, 2022

Since the workflow changes, the required checks change too. The .asf.yaml file is used to update the required checks. The changes are included in this PR, but it might require merging that file separately so that this PR could be merged.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @lhotari!

IIUC now the pulsarbot is not needed anymore and you have to enter the workflow run and re-run the single job, is it ?

build/pulsar_ci_tool.sh Outdated Show resolved Hide resolved
pulsar-io/docs/pom.xml Show resolved Hide resolved
@lhotari
Copy link
Member Author

lhotari commented Mar 23, 2022

IIUC now the pulsarbot is not needed anymore and you have to enter the workflow run and re-run the single job, is it ?

@nicoloboschi It's needed. This doesn't merge all the workflows to one. It's the majority of the workflows that are merged.
The bot is needed so that non-committers can restart failed jobs. It might require a tweak to pulsarbot so that it uses the GitHub Actions API to start just the failed jobs in a workflow and not the complete workflow and all jobs in it.

@lhotari
Copy link
Member Author

lhotari commented Mar 23, 2022

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 23, 2022

GitHub has been unavailable at times and GitHub Actions has been very slow. That has impacted the current run. https://www.githubstatus.com/

I'm trying to test "/pulsarbot rerun-failure-checks", but GitHub Actions isn't handling comment events at the moment, last issue_comment event is 2 hours ago: https://github.com/apache/pulsar/actions?query=event%3Aissue_comment

@lhotari
Copy link
Member Author

lhotari commented Mar 23, 2022

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 23, 2022

I have created a separate PR to make pulsarbot support the new GitHub Actions feature of rerunning failed jobs (instead of all jobs in a workflow): apache/pulsar-test-infra#27 /cc @nicoloboschi

@lhotari
Copy link
Member Author

lhotari commented Mar 24, 2022

/pulsarbot rerun-failure-checks

@lhotari lhotari closed this Mar 25, 2022
@lhotari lhotari reopened this Mar 25, 2022
@lhotari lhotari marked this pull request as draft March 25, 2022 14:54
@lhotari lhotari force-pushed the lh-github-actions-workflow-refactoring-2022 branch 2 times, most recently from f31fda3 to 0fdf6e6 Compare March 28, 2022 19:46
lhotari added a commit to lhotari/pulsar that referenced this pull request Mar 30, 2022
…sar CI workflow

- required checks must be changed before merging apache#14819
lhotari added a commit that referenced this pull request Mar 30, 2022
…sar CI workflow (#14939)

- required checks must be changed before merging #14819
@lhotari lhotari force-pushed the lh-github-actions-workflow-refactoring-2022 branch from 3a1ea2c to e09f155 Compare March 30, 2022 06:04
lhotari added 2 commits March 30, 2022 09:39
- "-Dmaven.test.failure.ignore=true -DtestRetryCount=0" was added to ignore test failures
  and disable test retries while developing the build
@lhotari
Copy link
Member Author

lhotari commented Mar 30, 2022

I made one more change: I increased the GitHub Actions Artifact retention period to 3 days for the intermediate build artifacts which are retained if a build job fails. The intermediate build artifacts are the Pulsar maven repository artifacts for the current build and the 2 different docker images used for integration and system tests. These consume about 2.5GB of disk space in total.

It's possible to rerun individual failed jobs when the build artifacts are available. If the artifacts have already been expired, the complete workflow can be rerun by closing and reopening the PR or by rebasing the PR.

The reason to minimize the retention period is cost. It costs $0.25 per GB for a total month of retention. The retention of 2.5GB for 3 days costs about 2.5 * $0.25 / 30 * 3= $0.0625 . That's 6 cents. I would assume that Apache has a certain amount of donated credits from GitHub to run GitHub Actions.

For personal accounts, there is 500MB free monthly credit. This means that you can keep 500MB stored for a month. The storage cost is calculated hourly and it means that you could store 500MB*30 for 1 day if you wish. After the free credit is consumed, GitHub Actions will be disabled until the next billing period starts or the user purchases more credits.

For GitHub Actions, the log file retention consumes the same storage quota. The usage of GitHub Actions Artifacts doesn't introduce anything new in that sense.

@lhotari
Copy link
Member Author

lhotari commented Mar 30, 2022

The reason to use 3 days for retention is convenience. If your build job fails on Friday, you can still retry it on Monday. Retries are currently relevant in Pulsar builds since there's a high number of flaky tests and retries are used so that flaky tests don't block progress. When a developer encounters a flaky test, it should always be checked whether it has already been reported as a GitHub issue. If not, the developer should report it.

lhotari added 3 commits March 30, 2022 11:13
- logs consume GitHub Actions storage quota
- logs will be available in uploaded surefire reports if the job fails
- java-test-image doesn't contain the Python client
@lhotari
Copy link
Member Author

lhotari commented Mar 30, 2022

There was a few last minute changes which required me to change the .asf.yaml once again. That's in #14944 .

I had forgotten -Dmaven.test.failure.ignore=true -DtestRetryCount=0 as part of MAVEN_OPTS which made all test jobs pass always. I used that during the development of the workflow to speed up testing since flaky tests would cause retries and delay testing. After fixing that issue, I noticed that the Pulsar Functions integration tests and Schema integration tests require the Pulsar python client which is not available in the "CI - Integration - *" tests in the new build since it uses a Java only docker image which is faster to build. The "CI - System - *" build jobs use the ordinary pulsar-test-latest-version-image which has been used before.
The reason to run some integration tests with the Java only docker image is the benefit of getting earlier feedback. Building the pulsar-test-latest-version-image takes over 25 minutes.

lhotari added 3 commits March 30, 2022 11:49
- it's not possible to merge the PR without required checks passing
This reverts commit 5f44f90.

Sql integration tests fail when logging to stdout is disabled.
Perhaps due to some race condition.
Re-enable logging.
- there might be a deadlock in an integration test too
@lhotari
Copy link
Member Author

lhotari commented Mar 30, 2022

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Mar 30, 2022

Merging the PR was blocked by #14951 . The Pulsar SQL tests stopped passing after a Java upgrade to 11.0.14.1 . To unblock CI, I have disabled Pulsar SQL integration tests in this PR. That's a temporary workaround until #14951 is resolved.

Pulsar SQL doesn't work with Java 11.0.14.1 version. Ubuntu got the 11.0.14.1 update today and that's why it happened to break just now.

@lhotari lhotari merged commit 8221770 into apache:master Mar 30, 2022
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
…sar CI workflow (apache#14939)

- required checks must be changed before merging apache#14819
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
* Combine multiple GitHub Actions workflows into a smaller amount of workflows
* disable Sql integration tests until apache#14951 has been resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci doc-not-needed Your PR changes do not impact docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up CI workflows
7 participants