[CI] Refactor GitHub Actions based Pulsar CI #14819

lhotari · 2022-03-23T10:39:14Z

Fixes #14401

Motivation

Improve Pulsar CI:

Reduce GitHub Action Runner resource consumption of Pulsar PR builds
- Currently, Pulsar GitHub Actions workflows are consuming the majority of the shared pool of resources allocated for github.com/apache projects
- Running the GitHub Actions workflows for a single PR to Pulsar consumes about 18-20 hours of GitHub Actions Runner VM time. This is too much.
Reduce lead times for Pull Request feedback by speeding up builds
- Speeds up Pulsar development
- Improves developer productivity since waiting times are reduced
- Since PR feedback is faster, developers can be comfortable submitting more granular pull requests.
- When development cycle is faster, it is easier to keep the pull request queue shorter. This has several benefits since when PRs are handled quickly, there are fewer chances for pull requests to divert from the master branch. It also reduces merge conflicts and the time wasted in resolving merge conflicts.
Better usability and access to test reports
- Less time is spent in looking for the reason why a build failed

Demonstration

https://github.com/apache/pulsar/actions/runs/2055350530 (for the current run in this PR)

Modifications

The design goal has been to keep the build content as the same as before the refactoring. The same tests are run, but in more effective ways. This refactoring doesn't make changes to the way how test retries are handled.
Combine most of the Pulsar CI workflows into a single workflow called "Pulsar CI"
- The workflows that benefit of the aggregation have been chosen.
- the modifications reuse binary artifacts in the workflow and this reduces the resource consumption.
  - Pulsar core modules jar files are built once and reused.
  - Pulsar docker images are built once and reused
  - GitHub Actions Artifacts is used to share the files within the workflow. The capacity of GitHub Actions Artifacts depend on the GitHub account or organization type. Paid plans have more capacity.
Integration tests are categorized into "integration tests" and "system tests"
- A slimmer docker image apachepulsar/java-test-image:latest is used to run the integration tests that don't depend on Pulsar Python client, Tiered storage drivers, Pulsar SQL or Pulsar Connectors.
- The previous apachepulsar/pulsar-test-latest-version:latest image is used to run the integration tests that are categorized as "system tests".
- The benefit of this split is that the java-test-image builds in about 6 minutes and can start the downstream integration test jobs after this. This results in faster developer feedback.
For debugging builds, there's configuration for exposing ssh shell access to each Build VM to the user who triggered the build ("github actor"). The ssh access is authenticated with the SSH key that the user has registered in GitHub.
- ssh access is only active in own forks. It is not enabled in apache/pulsar because of security concerns.
- A developer can open a PR to their own fork (for example with a single command with GH cli gh pr create --repo=githubusername/pulsar --base master --head "$(git branch --show-current)" -f) to run the build with ssh access enabled.
- ssh access is active for the duration of the build. If the build fails, the build waits 5 minutes for a developer to connect to investigate the problem. (this behavior is not enabled in apache/pulsar)

The SSH shell access feature will make it easier to debug CI issues which don't get resolved with the information in the GitHub Actions UI. This is an important capability to have available whenever there are problems. As described above, the configuration requires to run the build in a developer's personal fork of the pulsar repository to activate the feature.

Fix broken configuration in .github/actions/tune-runner-vm/action.yml which was broken with PR [CI] Fix Tune Runner VM memory.swappiness no such file or directory #13252.
- The makes Linux kernel's vm swappiness setting effectively 1 for all cgroups.
- Helps prevent swapping when the VM is running low on memory.
Improve test reporting by the use of https://github.com/dorny/test-reporter . The test reports get attached to the wrong workflow because of a GitHub Actions limitation. That reduces the usability since the test reports are harder to find. test-reporter renders the Junit XML files to the GitHub Actions UI.
Improve test reporting by adding warning annotations about the test statistics.
- not really warnings, but GitHub Actions doesn't seem to allow info annotations from shell scripts.
Use GitHub Action built-in feature to cancel duplicate build jobs:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

a new push to a PR will trigger a new job and this feature will be used to cancel the previous build which is obsolete
this solution might be more effective than the current solution to cancel duplicate jobs

Additional Context

The work in this PR was mainly done last year while working on a proof-of-concept of the GitHub Actions refactoring.
There's a Google document [Discuss] PIP Changes to GitHub Actions based Pulsar CI which describes details about some technical solutions. There's also an email thread on the dev mailing list.

The showstopper a year ago was the lack of being able to re-run a single failed job in a larger workflow.
GitHub has since then delivered this feature and no showstoppers are present.

lhotari · 2022-03-23T10:52:14Z

Since the workflow changes, the required checks change too. The .asf.yaml file is used to update the required checks. The changes are included in this PR, but it might require merging that file separately so that this PR could be merged.

nicoloboschi

Awesome work @lhotari!

IIUC now the pulsarbot is not needed anymore and you have to enter the workflow run and re-run the single job, is it ?

build/pulsar_ci_tool.sh

pulsar-io/docs/pom.xml

lhotari · 2022-03-23T10:56:03Z

IIUC now the pulsarbot is not needed anymore and you have to enter the workflow run and re-run the single job, is it ?

@nicoloboschi It's needed. This doesn't merge all the workflows to one. It's the majority of the workflows that are merged.
The bot is needed so that non-committers can restart failed jobs. It might require a tweak to pulsarbot so that it uses the GitHub Actions API to start just the failed jobs in a workflow and not the complete workflow and all jobs in it.

lhotari · 2022-03-23T16:09:54Z

/pulsarbot rerun-failure-checks

lhotari · 2022-03-23T16:12:51Z

GitHub has been unavailable at times and GitHub Actions has been very slow. That has impacted the current run. https://www.githubstatus.com/

I'm trying to test "/pulsarbot rerun-failure-checks", but GitHub Actions isn't handling comment events at the moment, last issue_comment event is 2 hours ago: https://github.com/apache/pulsar/actions?query=event%3Aissue_comment

lhotari · 2022-03-23T16:49:24Z

/pulsarbot rerun-failure-checks

lhotari · 2022-03-23T17:28:26Z

I have created a separate PR to make pulsarbot support the new GitHub Actions feature of rerunning failed jobs (instead of all jobs in a workflow): apache/pulsar-test-infra#27 /cc @nicoloboschi

lhotari · 2022-03-24T17:28:30Z

/pulsarbot rerun-failure-checks

…sar CI workflow - required checks must be changed before merging apache#14819

…sar CI workflow (#14939) - required checks must be changed before merging #14819

…of workflows

- might exist when re-running

…jobs fail

- "-Dmaven.test.failure.ignore=true -DtestRetryCount=0" was added to ignore test failures and disable test retries while developing the build

lhotari · 2022-03-30T07:16:02Z

I made one more change: I increased the GitHub Actions Artifact retention period to 3 days for the intermediate build artifacts which are retained if a build job fails. The intermediate build artifacts are the Pulsar maven repository artifacts for the current build and the 2 different docker images used for integration and system tests. These consume about 2.5GB of disk space in total.

It's possible to rerun individual failed jobs when the build artifacts are available. If the artifacts have already been expired, the complete workflow can be rerun by closing and reopening the PR or by rebasing the PR.

The reason to minimize the retention period is cost. It costs $0.25 per GB for a total month of retention. The retention of 2.5GB for 3 days costs about 2.5 * $0.25 / 30 * 3= $0.0625 . That's 6 cents. I would assume that Apache has a certain amount of donated credits from GitHub to run GitHub Actions.

For personal accounts, there is 500MB free monthly credit. This means that you can keep 500MB stored for a month. The storage cost is calculated hourly and it means that you could store 500MB*30 for 1 day if you wish. After the free credit is consumed, GitHub Actions will be disabled until the next billing period starts or the user purchases more credits.

For GitHub Actions, the log file retention consumes the same storage quota. The usage of GitHub Actions Artifacts doesn't introduce anything new in that sense.

lhotari · 2022-03-30T07:21:28Z

The reason to use 3 days for retention is convenience. If your build job fails on Friday, you can still retry it on Monday. Retries are currently relevant in Pulsar builds since there's a high number of flaky tests and retries are used so that flaky tests don't block progress. When a developer encounters a flaky test, it should always be checked whether it has already been reported as a GitHub issue. If not, the developer should report it.

…le in java-test-image

- logs consume GitHub Actions storage quota - logs will be available in uploaded surefire reports if the job fails

- java-test-image doesn't contain the Python client

lhotari · 2022-03-30T08:35:12Z

There was a few last minute changes which required me to change the .asf.yaml once again. That's in #14944 .

I had forgotten -Dmaven.test.failure.ignore=true -DtestRetryCount=0 as part of MAVEN_OPTS which made all test jobs pass always. I used that during the development of the workflow to speed up testing since flaky tests would cause retries and delay testing. After fixing that issue, I noticed that the Pulsar Functions integration tests and Schema integration tests require the Pulsar python client which is not available in the "CI - Integration - *" tests in the new build since it uses a Java only docker image which is faster to build. The "CI - System - *" build jobs use the ordinary pulsar-test-latest-version-image which has been used before.
The reason to run some integration tests with the Java only docker image is the benefit of getting earlier feedback. Building the pulsar-test-latest-version-image takes over 25 minutes.

- it's not possible to merge the PR without required checks passing

This reverts commit 5f44f90. Sql integration tests fail when logging to stdout is disabled. Perhaps due to some race condition. Re-enable logging.

- there might be a deadlock in an integration test too

lhotari · 2022-03-30T13:00:15Z

/pulsarbot run-failure-checks

lhotari · 2022-03-30T14:19:58Z

Merging the PR was blocked by #14951 . The Pulsar SQL tests stopped passing after a Java upgrade to 11.0.14.1 . To unblock CI, I have disabled Pulsar SQL integration tests in this PR. That's a temporary workaround until #14951 is resolved.

Pulsar SQL doesn't work with Java 11.0.14.1 version. Ubuntu got the 11.0.14.1 update today and that's why it happened to break just now.

…sar CI workflow (apache#14939) - required checks must be changed before merging apache#14819

* Combine multiple GitHub Actions workflows into a smaller amount of workflows * disable Sql integration tests until apache#14951 has been resolved

.github/workflows/pulsar-ci.yaml

lhotari added doc-not-needed Your PR changes do not impact docs area/ci labels Mar 23, 2022

lhotari requested review from merlimat, sijie, tuteng, Technoboy-, aahmed-se, dlg99, eolivelli, codelipenghui, hezhangjian, zymap, dave2wave and michaeljmarshall March 23, 2022 10:39

lhotari self-assigned this Mar 23, 2022

lhotari mentioned this pull request Mar 23, 2022

Speed up CI workflows #14401

Closed

nicoloboschi approved these changes Mar 23, 2022

View reviewed changes

build/pulsar_ci_tool.sh Outdated Show resolved Hide resolved

pulsar-io/docs/pom.xml Show resolved Hide resolved

lhotari mentioned this pull request Mar 23, 2022

Use the new GitHub API to re-run failed jobs apache/pulsar-test-infra#27

Merged

lhotari closed this Mar 25, 2022

lhotari reopened this Mar 25, 2022

lhotari marked this pull request as draft March 25, 2022 14:54

lhotari force-pushed the lh-github-actions-workflow-refactoring-2022 branch 2 times, most recently from f31fda3 to 0fdf6e6 Compare March 28, 2022 19:46

lhotari added a commit to lhotari/pulsar that referenced this pull request Mar 30, 2022

[CI] Change the required checks to match the checks in refactored Pul…

82360a5

…sar CI workflow - required checks must be changed before merging apache#14819

lhotari mentioned this pull request Mar 30, 2022

[CI] Change required checks to match checks in refactored Pulsar CI workflow #14939

Merged

lhotari added a commit that referenced this pull request Mar 30, 2022

[CI] Change the required checks to match the checks in refactored Pul…

5aa3692

…sar CI workflow (#14939) - required checks must be changed before merging #14819

lhotari added 5 commits March 30, 2022 09:02

[CI] Combine multiple GitHub Actions workflows into a smaller amount …

7caacae

…of workflows

Delete artifacts after successful build

978a918

Delete possible previous artifact

9951dbd

- might exist when re-running

Don't upload surefire reports and container logs to artifacts unless …

6a8656d

…jobs fail

Cleanup log artifacts from possible previous failed build

e09f155

lhotari force-pushed the lh-github-actions-workflow-refactoring-2022 branch from 3a1ea2c to e09f155 Compare March 30, 2022 06:04

lhotari added 2 commits March 30, 2022 09:39

Remove build testing options

2b9972e

- "-Dmaven.test.failure.ignore=true -DtestRetryCount=0" was added to ignore test failures and disable test retries while developing the build

Increase build artifact retention to 3 days

a4997bf

lhotari added 3 commits March 30, 2022 11:13

Move Function tests to system tests since Python client isn't availab…

fda6aa2

…le in java-test-image

Disable test output to console in integration tests

5f44f90

- logs consume GitHub Actions storage quota - logs will be available in uploaded surefire reports if the job fails

Move Schema tests to system tests too since Python client is required

5a4acfa

- java-test-image doesn't contain the Python client

lhotari mentioned this pull request Mar 30, 2022

[CI] Change the required checks to match refactored Pulsar CI #14944

Merged

lhotari added 3 commits March 30, 2022 11:49

Add workaround for required checks

de24903

- it's not possible to merge the PR without required checks passing

Revert "Disable test output to console in integration tests"

f5d118f

This reverts commit 5f44f90. Sql integration tests fail when logging to stdout is disabled. Perhaps due to some race condition. Re-enable logging.

Print thread dumps when integration tests get cancelled

041112a

- there might be a deadlock in an integration test too

merlimat approved these changes Mar 30, 2022

View reviewed changes

disable Sql integration tests until apache#14951 has been resolved

a45a278

lhotari merged commit 8221770 into apache:master Mar 30, 2022

Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022

[CI] Change the required checks to match the checks in refactored Pul…

eebcd5b

…sar CI workflow (apache#14939) - required checks must be changed before merging apache#14819

tisonkun reviewed Aug 1, 2022

View reviewed changes

.github/workflows/pulsar-ci.yaml Show resolved Hide resolved

YouJiacheng mentioned this pull request Feb 21, 2023

Attempt to reuse build in Action YouJiacheng/IGinX#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Refactor GitHub Actions based Pulsar CI #14819

[CI] Refactor GitHub Actions based Pulsar CI #14819

lhotari commented Mar 23, 2022 •

edited

Loading

lhotari commented Mar 23, 2022

nicoloboschi left a comment

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 24, 2022

lhotari commented Mar 30, 2022 •

edited

Loading

lhotari commented Mar 30, 2022

lhotari commented Mar 30, 2022

lhotari commented Mar 30, 2022

lhotari commented Mar 30, 2022 •

edited

Loading

[CI] Refactor GitHub Actions based Pulsar CI #14819

[CI] Refactor GitHub Actions based Pulsar CI #14819

Conversation

lhotari commented Mar 23, 2022 • edited Loading

Motivation

Demonstration

Modifications

Additional Context

lhotari commented Mar 23, 2022

nicoloboschi left a comment

Choose a reason for hiding this comment

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 23, 2022

lhotari commented Mar 24, 2022

lhotari commented Mar 30, 2022 • edited Loading

lhotari commented Mar 30, 2022

lhotari commented Mar 30, 2022

lhotari commented Mar 30, 2022

lhotari commented Mar 30, 2022 • edited Loading

lhotari commented Mar 23, 2022 •

edited

Loading

lhotari commented Mar 30, 2022 •

edited

Loading

lhotari commented Mar 30, 2022 •

edited

Loading