Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Retrospective for September and October 2024 Releases #54

Open
8 tasks
adamfarley opened this issue Aug 1, 2024 · 22 comments
Open
8 tasks

General Retrospective for September and October 2024 Releases #54

adamfarley opened this issue Aug 1, 2024 · 22 comments

Comments

@adamfarley
Copy link
Contributor

Summary

A retrospective for all efforts surrounding the titular releases.

All community members are welcome to contribute to the agenda via comments below.

This will be a virtual meeting after the release, with at least a week of notice in the #release Slack channel.

On the day of the meeting we'll review the agenda and add a list of actions at the end.

Invited: Everyone.

Time, Date, and URL

Time:
Date:
URL:

Details

Retrospective Owner Tasks (in order):

  • Post retro URL in #Release around the start of the new release.
  • Wait until most builds are released, with no signs of a respin.
  • Announce the retrospective's date + time on #Release a week in advance.
  • Host the retrospective:
    • Go through the agenda.
    • Create a list of actions.
  • Process each action:
    • Create a "WIP" issue including the source comment.
    • Add the issue to the current iteration.
    • Add an issue link to the action list.
  • Create a new retrospective issue for the next release.
  • Set a calendar reminder so you remember to do step 1 before the next release.
  • Close this issue.

TLDR

Add proposed agenda items as comments below.

@andrew-m-leonard
Copy link
Contributor

build repo release branches don't have mandatory PR review, probably as settings regex does not match...?

@andrew-m-leonard
Copy link
Contributor

build repo code freeze check for the release branch was not enabled, but then I thought, do we really need it, especially if we get the release branch mandatory review fixed?

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Sep 10, 2024

Currently dryrun tags are the tag previous to the suspected actual GA tag, since it's not easy to "reset" the auto-trigger, maybe we ought to fix that...?

fyi, a bit naff!, but to do a trigger "reset" (since I had to do one for a failed dryrun trigger!)
As a Jenkins "Admin":

println "rm /home/jenkins/workspace/build-scripts/utils/releaseTrigger_jdk23/workspace/tracking".execute().text

@andrew-m-leonard
Copy link
Contributor

getTestDependency was failing on temurin-compliance due to no authentication: adoptium/aqa-tests#5589
This was failing in the July release as well, but failure of this stage does not fail the job.. which means we use the workspace cache, if we have one, and whatever maybe there!

@smlambert
Copy link
Contributor

re: #54 (comment)

This was failing in the July release as well, but failure of this stage does not fail the job.. which means we use the workspace cache, if we have one

Do not think there is anything in the dependencies list that gets used by the TC jobs (but could affect if we are using TC Grinder to verify AQAvit tests, though most dependencies do not change often, so cached versions are fine).

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Sep 12, 2024

TRSS needs new JDK versions adding before release week, release-openjdk23-pipeline was missing.

SL/Sept12 - now added

@adoptium adoptium deleted a comment from llxia Sep 12, 2024
@adoptium adoptium deleted a comment from andrew-m-leonard Sep 12, 2024
@andrew-m-leonard
Copy link
Contributor

We should be more accurate with our release process terminology:
Publish updates to the containers to dockerhub
should be:
Publish docker images to dockerhub

@sophia-guo
Copy link
Contributor

When doing the triage, the tap files of the grinder should be attached to the triage issue , for example adoptium/aqa-tests#5598. So the job https://ci.adoptium.net/view/Test_grinder/job/TAP_Collection can collect tap files of pipeline job and tap files of grinder.

@sophia-guo
Copy link
Contributor

For trss if rerun job passes the corresponding test job status should be set as pass, so no need to do the extra triage. For example https://trss.adoptium.net/resultSummary?parentId=66e2f744d24e1b006e88e097 aarch64_mac, extended.openjdk rerun passed, the extended.openjdk should set as success.

@sophia-guo
Copy link
Contributor

AQA triage, using the auto generated rerun links of rerun test job, which has already prepopulated either failed test targets or failed test cases. https://ci.adoptium.net/job/Test_openjdk23_hs_extended.openjdk_x86-64_windows_rerun/19/

@smlambert
Copy link
Contributor

For trss if rerun job passes the corresponding test job status should be set as pass, so no need to do the extra triage. For example https://trss.adoptium.net/resultSummary?parentId=66e2f744d24e1b006e88e097 aarch64_mac, extended.openjdk rerun passed, the extended.openjdk should set as success.

Quick checks to make when triaging, look at the rerun.tap file on the Jenkins job, if its green, nothing to do.

We should also have a different chiclet icon for this "state" where rerun job passes. Suggest a yellow chiclet with a small green circle in top right corner for that state and so forth. Related issue: adoptium/aqa-test-tools#912

@sophia-guo
Copy link
Contributor

There are almost no tests jobs were triggered by openjdk**-pipeline or evaluation-openjdk**-pipeline during September release ( i.e, ea build triggered nightly or weekly). As we set around 10 days before and 5 days after release as the no nightly tests job window. https://github.com/adoptium/ci-jenkins-pipelines/blob/master/pipelines/build/common/trigger_beta_build.groovy#L53-L79, which might be fine with January, March, July and September releases. May not be good for October and April releases.

Due to the scheduling of releases in September and October, as well as in March and April, there is a potential overlap that could result in gaps in testing. Specifically, with releases in March and September, followed closely by April and October, there may be minimal time available for comprehensive testing between those consecutive releases. As a result, critical tests may be rushed or omitted, impacting the stability of those releases. For example, reproducible comparing tests on linux are updated in Sep 6th and after that the test was only run once with jdk24 by Oct2.

@andrew-m-leonard
Copy link
Contributor

There are almost no tests jobs were triggered by openjdk**-pipeline or evaluation-openjdk**-pipeline during September release ( i.e, ea build triggered nightly or weekly). As we set around 10 days before and 5 days after release as the no nightly tests job window. https://github.com/adoptium/ci-jenkins-pipelines/blob/master/pipelines/build/common/trigger_beta_build.groovy#L53-L79, which might be fine with January, March, July and September releases. May not be good for October and April releases.

Due to the scheduling of releases in September and October, as well as in March and April, there is a potential overlap that could result in gaps in testing. Specifically, with releases in March and September, followed closely by April and October, there may be minimal time available for comprehensive testing between those consecutive releases. As a result, critical tests may be rushed or omitted, impacting the stability of those releases. For example, reproducible comparing tests on linux are updated in Sep 6th and after that the test was only run once with jdk24 by Oct2.

To add some extra info, for example jdk-21.0.5+7 and +8 EA builds both landed during the Sept release "disabled test" period, jdk-21.0.5+6 EA was the last build run with tests prior to release, and jdk-21.0.5+9 after:
image

@smlambert
Copy link
Contributor

October release

  • dynamic agents for x64Linux were unexpectedly in play (due to having ci.role.test on them which should not be the case, and also because they stay around for 1hr once spun up).
  • ppc64le_linux problematic machine needed to be taken offline (CURL_OPENSSL_3 not found), would benefit from turning on our "automatically take problem machines offline" feature in test pipelines, to avoid sending more jobs to problem machine)

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Oct 21, 2024

October:
Care needs taking when publishing binaries to check if a platform was rebuilt, for example both jdk17 macAarch64 and jdk17 pLinux were rebuilt, but binaries were still present on the original pipeline. Mac was initially published from the wrong one.

Can we remove bad build artifacts? when we rebuild...

@andrew-m-leonard
Copy link
Contributor

October, we forgot to publish JDK11 aarch64 mac even though it had been finished for several days

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Oct 23, 2024

status by platform document #60 is not always being updated...
I think we need to automate this, it's too easy to forget or update wrongly

@andrew-m-leonard
Copy link
Contributor

misstakes were made in selecting publish job links, meaning a platform didn't get published when we said it was, due to clicking on WindowsX64 rather than Windowsx32...

@smlambert
Copy link
Contributor

aarch64 windows was added as a platform for jdk21 and jdk23, but there were several changes required for it to be ready.

This could have happened well ahead of the release period (as per the plan discussed in past PMC mtg), it could have also been seen during a dry run, but no dry run was performed (were other checklist items not completed, seemed the release champion was not always present and in that event missed the opportunity to communicate that to others and ensure tasks were delegated).

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Oct 25, 2024

We need to invest resource in making the Installers publishing a lot better and automated.
In its current form it mentally scars you !!

@sophia-guo
Copy link
Contributor

sophia-guo commented Oct 28, 2024

adoptium/aqa-tests#5692 (comment)

Some arm32 jdk8 tests used to work on non-containers agents. Seems we don't have them any more https://ci.adoptium.net/label/ci.role.test&&sw.os.linux&&hw.arch.aarch32/. If the tests can only pass on non-containers we might need to do a vendor exclude due to our eclipse machine farm having limitations. https://github.com/adoptium/aqa-tests/blob/master/openjdk/excludes/vendors/eclipse/ProblemList_openjdk8.txt

@andrew-m-leonard
Copy link
Contributor

I think this release has demonstrated the necessity of a dry-run, but also the issue with the "installers" and the new Azure VMs demonstrates the need for a dry-run installers upload possibly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants