Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serverless emergency release quality gate pipeline #186833

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

pheyos
Copy link
Member

@pheyos pheyos commented Jun 24, 2024

Summary

This PR adds separately quality gate pipelines for the emergency release process.
This gives us the opportunity to run a different set of checks during an emergency release compared to a regular release.

Details

  • Add new emergency quality gates pipeline definitions in .buildkite/pipelines/quality-gates/emergency. These are copies of the regular quality gates pipeline files with the following adjustments:
    • The entry point kibana-serverless-quality-gates-emergency.yml has an adjusted QG_PIPELINE_LOCATION and comment
    • The QA quality gates in pipeline.tests-qa.yaml is reduced to just the CP e2e tests
  • Add new pipeline .buildkite/pipeline-resource-definitions/kibana-serverless-quality-gates-emergency.yml is added that will trigger the emergency version of the quality gates.

Other changes

In order to have things around the serverless quality gates and the emergency release consistent, I've taken the opportunity and moved the definitions of the following pipelines from catalog-info.yaml to .buildkite/pipeline-resource-definitions

  • buildkite-pipeline-kibana-emergency-release -> .buildkite/pipeline-resource-definitions/kibana-serverless-emergency-release.yml
  • kibana-tests-pipeline -> .buildkite/pipeline-resource-definitions/kibana-serverless-quality-gates.yml

@pheyos pheyos added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting v8.15.0 labels Jun 24, 2024
@pheyos pheyos self-assigned this Jun 24, 2024
@pheyos pheyos requested review from a team as code owners June 24, 2024 16:05
Copy link
Member

@lukeelmers lukeelmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Since QA pipeline skips manual verification for an emergency release, the added time of running QA first before staging should be minimal as the RM won't be a bottleneck. If cp e2e test times ever start to creep up, we can reconsider.

My one question is whether we definitely want to include a 24h bake period by default for an emergency release. If it is a true emergency, I would expect us to never wait a full 24h before releasing to noncanary. But I suppose the consistency with the normal release process makes this easier to understand, and gives the RM more latitude to determine how long we want changes to sit in canary. Overall this is a question I wanted to raise -- not a concern -- so I'm fine with proceeding as-is.

@pheyos
Copy link
Member Author

pheyos commented Jun 27, 2024

Thanks for your feedback @lukeelmers!

My one question is whether we definitely want to include a 24h bake period by default for an emergency release. If it is a true emergency, I would expect us to never wait a full 24h before releasing to noncanary.

I was thinking about the same. And I agree we will probably never use the full 24 hours in an emergency release.

gives the RM more latitude to determine how long we want changes to sit in canary

This was my main reason for keeping a long bake time here. Elasticsearch is running with 1 hour bake time in production-canary for emergency releases. This allows them to be more hands-off and just let the release roll out if automated checks are passing. However, if some investigation in canary takes more time (which, to be fair, didn't happen so far), it would require stopping the release and kicking the last stage off again. So I thought we'd start a bit more defensive here even though this will require one more manual RM step (cancel the bake time early).
Ultimately, I think we'd like to get to a place where this whole process is mostly automated. In such a world, we definitely need a shorter bake time (like 1 hour). So it basically comes down to the question: do we want to start closer to our goal (1 hour bake time) and adjust up if it doesn't work out or do we start with 24 hours bake time and run with it for some time, knowing that we'll bring it down in the future? I'm fine either way as both approaches have working escape hatches if needed.

@lukeelmers
Copy link
Member

do we want to start closer to our goal (1 hour bake time) and adjust up if it doesn't work out or do we start with 24 hours bake time and run with it for some time, knowing that we'll bring it down in the future?

I agree with your logic of being more defensive and starting with a longer bake time, with the understanding that RMs will need to be (as they already have been) very involved in any emergency release process. We can document short-circuiting the bake time as the expected course of action, with the exact bake period determined by the RM based on their judgment of the situation.

We can revisit later and shorten the bake period once we get more comfortable with the new process.

@pheyos
Copy link
Member Author

pheyos commented Jun 28, 2024

@elasticmachine merge upstream

@pheyos
Copy link
Member Author

pheyos commented Jun 28, 2024

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @pheyos

@pheyos pheyos merged commit cbedb5f into elastic:main Jun 28, 2024
21 checks passed
@pheyos pheyos deleted the add_serverless_emergency_quality_gates branch June 28, 2024 11:04
jbudz added a commit that referenced this pull request Jun 28, 2024
@pheyos pheyos restored the add_serverless_emergency_quality_gates branch July 1, 2024 11:05
pheyos added a commit that referenced this pull request Jul 2, 2024
## Summary

This PR adds separately quality gate pipelines for the emergency release
process.

More details in the original PR #186833, which is split into the
creation of the new pipeline (this PR) and moving existing pipelines
from `catalog-info.yaml` to `.buildkite/pipeline-resource-definitions`
(#187253).
pheyos added a commit that referenced this pull request Jul 5, 2024
## Summary

This PR moves the definitions of the following pipelines from
`catalog-info.yaml` to `.buildkite/pipeline-resource-definitions`:
- `buildkite-pipeline-kibana-emergency-release` ->
`.buildkite/pipeline-resource-definitions/kibana-serverless-emergency-release.yml`
- `kibana-tests-pipeline` ->
`.buildkite/pipeline-resource-definitions/kibana-serverless-quality-gates.yml`

More details in the original PR #186833, which is split into the
creation of the new pipeline (#187251) and moving existing pipelines
from catalog-info.yaml to .buildkite/pipeline-resource-definitions (this
PR).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants