-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Refactor GitHub Actions based Pulsar CI #14819
[CI] Refactor GitHub Actions based Pulsar CI #14819
Conversation
Since the workflow changes, the required checks change too. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @lhotari!
IIUC now the pulsarbot is not needed anymore and you have to enter the workflow run and re-run the single job, is it ?
@nicoloboschi It's needed. This doesn't merge all the workflows to one. It's the majority of the workflows that are merged. |
/pulsarbot rerun-failure-checks |
GitHub has been unavailable at times and GitHub Actions has been very slow. That has impacted the current run. https://www.githubstatus.com/ I'm trying to test "/pulsarbot rerun-failure-checks", but GitHub Actions isn't handling comment events at the moment, last issue_comment event is 2 hours ago: https://github.com/apache/pulsar/actions?query=event%3Aissue_comment |
/pulsarbot rerun-failure-checks |
I have created a separate PR to make pulsarbot support the new GitHub Actions feature of rerunning failed jobs (instead of all jobs in a workflow): apache/pulsar-test-infra#27 /cc @nicoloboschi |
/pulsarbot rerun-failure-checks |
f31fda3
to
0fdf6e6
Compare
…sar CI workflow - required checks must be changed before merging apache#14819
- might exist when re-running
3a1ea2c
to
e09f155
Compare
- "-Dmaven.test.failure.ignore=true -DtestRetryCount=0" was added to ignore test failures and disable test retries while developing the build
I made one more change: I increased the GitHub Actions Artifact retention period to 3 days for the intermediate build artifacts which are retained if a build job fails. The intermediate build artifacts are the Pulsar maven repository artifacts for the current build and the 2 different docker images used for integration and system tests. These consume about 2.5GB of disk space in total. It's possible to rerun individual failed jobs when the build artifacts are available. If the artifacts have already been expired, the complete workflow can be rerun by closing and reopening the PR or by rebasing the PR. The reason to minimize the retention period is cost. It costs $0.25 per GB for a total month of retention. The retention of 2.5GB for 3 days costs about 2.5 * $0.25 / 30 * 3= $0.0625 . That's 6 cents. I would assume that Apache has a certain amount of donated credits from GitHub to run GitHub Actions. For personal accounts, there is 500MB free monthly credit. This means that you can keep 500MB stored for a month. The storage cost is calculated hourly and it means that you could store 500MB*30 for 1 day if you wish. After the free credit is consumed, GitHub Actions will be disabled until the next billing period starts or the user purchases more credits. For GitHub Actions, the log file retention consumes the same storage quota. The usage of GitHub Actions Artifacts doesn't introduce anything new in that sense. |
The reason to use 3 days for retention is convenience. If your build job fails on Friday, you can still retry it on Monday. Retries are currently relevant in Pulsar builds since there's a high number of flaky tests and retries are used so that flaky tests don't block progress. When a developer encounters a flaky test, it should always be checked whether it has already been reported as a GitHub issue. If not, the developer should report it. |
…le in java-test-image
- logs consume GitHub Actions storage quota - logs will be available in uploaded surefire reports if the job fails
- java-test-image doesn't contain the Python client
There was a few last minute changes which required me to change the I had forgotten |
- it's not possible to merge the PR without required checks passing
This reverts commit 5f44f90. Sql integration tests fail when logging to stdout is disabled. Perhaps due to some race condition. Re-enable logging.
- there might be a deadlock in an integration test too
/pulsarbot run-failure-checks |
Merging the PR was blocked by #14951 . The Pulsar SQL tests stopped passing after a Java upgrade to 11.0.14.1 . To unblock CI, I have disabled Pulsar SQL integration tests in this PR. That's a temporary workaround until #14951 is resolved. Pulsar SQL doesn't work with Java 11.0.14.1 version. Ubuntu got the 11.0.14.1 update today and that's why it happened to break just now. |
…sar CI workflow (apache#14939) - required checks must be changed before merging apache#14819
* Combine multiple GitHub Actions workflows into a smaller amount of workflows * disable Sql integration tests until apache#14951 has been resolved
Fixes #14401
Motivation
Improve Pulsar CI:
Reduce GitHub Action Runner resource consumption of Pulsar PR builds
Reduce lead times for Pull Request feedback by speeding up builds
Better usability and access to test reports
Demonstration
Modifications
The design goal has been to keep the build content as the same as before the refactoring. The same tests are run, but in more effective ways. This refactoring doesn't make changes to the way how test retries are handled.
Combine most of the Pulsar CI workflows into a single workflow called "Pulsar CI"
Integration tests are categorized into "integration tests" and "system tests"
apachepulsar/java-test-image:latest
is used to run the integration tests that don't depend on Pulsar Python client, Tiered storage drivers, Pulsar SQL or Pulsar Connectors.apachepulsar/pulsar-test-latest-version:latest
image is used to run the integration tests that are categorized as "system tests".For debugging builds, there's configuration for exposing ssh shell access to each Build VM to the user who triggered the build ("github actor"). The ssh access is authenticated with the SSH key that the user has registered in GitHub.
apache/pulsar
because of security concerns.gh pr create --repo=githubusername/pulsar --base master --head "$(git branch --show-current)" -f
) to run the build with ssh access enabled.apache/pulsar
)The SSH shell access feature will make it easier to debug CI issues which don't get resolved with the information in the GitHub Actions UI. This is an important capability to have available whenever there are problems. As described above, the configuration requires to run the build in a developer's personal fork of the pulsar repository to activate the feature.
Fix broken configuration in
.github/actions/tune-runner-vm/action.yml
which was broken with PR [CI] Fix Tune Runner VM memory.swappiness no such file or directory #13252.1
for all cgroups.Improve test reporting by the use of https://github.com/dorny/test-reporter . The test reports get attached to the wrong workflow because of a GitHub Actions limitation. That reduces the usability since the test reports are harder to find. test-reporter renders the Junit XML files to the GitHub Actions UI.
Improve test reporting by adding warning annotations about the test statistics.
Use GitHub Action built-in feature to cancel duplicate build jobs:
Additional Context
The work in this PR was mainly done last year while working on a proof-of-concept of the GitHub Actions refactoring.
There's a Google document [Discuss] PIP Changes to GitHub Actions based Pulsar CI which describes details about some technical solutions. There's also an email thread on the dev mailing list.
The showstopper a year ago was the lack of being able to re-run a single failed job in a larger workflow.
GitHub has since then delivered this feature and no showstoppers are present.