Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create qualification tests for images in staging #144

Closed
listx opened this issue Oct 21, 2019 · 35 comments
Closed

create qualification tests for images in staging #144

listx opened this issue Oct 21, 2019 · 35 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@listx
Copy link
Contributor

listx commented Oct 21, 2019

As noted in kubernetes/k8s.io#406 (review), the staging image up for promotion should have a link to it being tested (maybe some more thorough e2e tests going through various flags, etc). This could be a Prow postsubmit job. That job could run against master branch, and for every update to it:

  1. pushes up a new version to the staging repository
  2. run various e2e tests (running the same e2e test suite we already have)

Then the logs for this job itself could be linked-to for every promotion PR promoting this newly-tested version to prod.

@rajibmitra
Copy link

I would like to work on the prow job if its ok @listx @rikatz

@rajibmitra
Copy link

/assign rajibmitra

@rikatz
Copy link

rikatz commented Oct 29, 2019 via email

@listx
Copy link
Contributor Author

listx commented Oct 29, 2019

Sounds good to me. Please write a design doc of some sort to sketch out your thoughts first before pushing up an implementation. You can then share it with https://groups.google.com/forum/#!newtopic/kubernetes-wg-k8s-infra

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2020
@rajibmitra
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2020
@rikatz
Copy link

rikatz commented Jan 27, 2020

@rajibmitra do you need something (any help, insights) to write that document? :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@listx
Copy link
Contributor Author

listx commented Jun 30, 2020

/reopen

The work that @yodahekinsew is doing touches on this area, as he is expanding the number of various tests to check against staging images. See https://docs.google.com/document/d/1D5FEU58j3_V8Fxl98I3vcqNCxzpD9cV9Gtpxf9wGO4I/edit#heading=h.4bti0lhcgu9m.

@k8s-ci-robot k8s-ci-robot reopened this Jun 30, 2020
@k8s-ci-robot
Copy link
Contributor

@listx: Reopened this issue.

In response to this:

/reopen

The work that @yodahekinsew is doing touches on this area, as he is expanding the number of various tests to check against staging images. See https://docs.google.com/document/d/1D5FEU58j3_V8Fxl98I3vcqNCxzpD9cV9Gtpxf9wGO4I/edit#heading=h.4bti0lhcgu9m.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@listx
Copy link
Contributor Author

listx commented Aug 6, 2021

/reopen
/remove-lifecycle rotten
/lifecycle active

Intro

OK so we got bitten by the lack of QA tests for staging images with kubernetes/test-infra#23148 (reverted in kubernetes/test-infra#23151). There are several things going on, but let me capture what's happening today. You can think of this comment as a mini-postmortem doc.

Current state of affairs

Step 1: Code changes to CIP repo

We merge PRs into this repo to update the master branch. When that branch moves (a PR is merged to improve the CIP tool and other relating things), post-cip-push-image-cip runs (config here) and a new image appears in gcr.io/k8s-staging-artifact-promoter/cip-auditor.

Step 2: Promotion from staging to prod

Later, when the powers that be deem that the image should be promoted to production, we create a PR to "promote the promoter", like this. When the PR is merged, the staging image is promoted to prod (via post-k8sio-image-promo).

Step 3: Update of Prow Job configuration

Finally, we can now also update existing Prow Jobs that rely on the promoter to use this new production version. A PR to update the Prow Jobs is created, to bump the image version used, like this.

The fire/problem

Now consider kubernetes/test-infra#23148. Alas, it needed to be reverted. The problem (captured in the logs) was that the CLI flag handling logic changed between the old version used prior to this PR (version v2.4.1) and the one proposed by this PR (version v20210805-v1.337.0-136-g8f6bfde). (Additionally, @justaugustus reported that we also had not yet fixed the --dry-run behavior yet in this version, but I won't be going into that here.) So in summary, even if the promoter image in isolation was well-formed, its use in the Prow Job configuration was incorrect. That is, kubernetes/test-infra#23148 should have included changes to the CLI arguments passed into the image.

See https://kubernetes.slack.com/archives/CJH2GBF7Y/p1628202615135000 for initial thoughts on this problem.

Proposed changes

So how do we prevent this problem in the future?

Apart from a more formalized process for designating new promoter images for production (involving more review from the Release Engineering group), there are additional changes we can make:

  • 1. We should create a Prow Job configuration test to enforce rules for how we invoke the promoter image (e.g., make sure that the arguments we're passing into the image are correct), so that when we attempt to bump a version of the promoter in a Prow Job, that change can be tested. We can use the pattern of creating a unit test in Prow (see this example). This will automatically act as a presubmit for Step 3 above.
  • 2. We can additionally have a similar test (perhaps using https://github.com/GoogleContainerTools/container-structure-test as a framework) to check CLI arguments as part of Step 2. That is, we could have a new presubmit job that watches the promoter's own promotion manifest file and then run a test that performs QA against the to-be-promoted-to-production image. The lack of such a test was the reason why this GitHub issue was originally created. As per the OP, this test could be implemented as a postsubmit against the CIP codebase instead of the promoter manifest for it as described earlier, although in the former case we'd have to probably run a sleep loop to wait until there is a staging image available for testing (that is, wait for post-cip-push-image-cip to finish before executing).
  • 3. We can also add unit tests in the CIP repo directly that also check for CLI arguments/flag handling. This will force us to bump version numbers, etc. whenever these tests break (or because there are simply newer features). The unit tests here are somewhat removed from the original concern of this postmortem, but they would still bring up issues regarding CLI arguments worth knowing about.

I think we should do all 3 of the above, in the order they are listed. But how can we maximize code reuse across all 3 changes? It would be painful to have 3 separate sets of tests, requiring us to update each of these separately. I assume changes 2 and 3 can somehow share code, but I'm not sure about the first.

@listx listx reopened this Aug 6, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 6, 2021
@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Aug 6, 2021
@listx
Copy link
Contributor Author

listx commented Aug 6, 2021

/assign @listx

@justaugustus
Copy link
Member

So it's in my queue to help with:
/assign

FYI @kubernetes-sigs/release-engineering on #144 (comment).

@listx
Copy link
Contributor Author

listx commented Aug 6, 2021

@justaugustus I think I can help with adding a unit test in Prow (proposed change 1). I'm going to be working more in Prow anyway so it's a good exercise.

@tylerferrara
Copy link
Contributor

Two questions:
Will these proposed solutions get their own issues?
In Step 3, are we suggesting the addition of another verify script to enforce a standard API? If so, I like this idea. As you mentioned, this API definition should be referenced in Step 1. Therefore, changing the API would only require a single file change.

Thinking out loud: In addition, we may want to reference this API in the README for new users to rely on specific behavior. New releases could ensure a set of static features, while incrementally adding new flags. This could be thought of as a "core" and "extra" set of features.

@listx
Copy link
Contributor Author

listx commented Aug 6, 2021

Two questions:
Will these proposed solutions get their own issues?

Ideally yes; I'll try to keep this thread synced with any new issues related to this area.

In Step 3, are we suggesting the addition of another verify script to enforce a standard API? If so, I like this idea. As you mentioned, this API definition should be referenced in Step 1. Therefore, changing the API would only require a single file change.

Yeah we could use some sort of standardized API definition for CIP and related tools. Maybe that "source of truth" will be in the form of a YAML configuration defined for https://github.com/GoogleContainerTools/container-structure-test. (Or is just rolling our own "CLI API" YAML definition better? IDK)

Thinking out loud: In addition, we may want to reference this API in the README for new users to rely on specific behavior. New releases could ensure a set of static features, while incrementally adding new flags. This could be thought of as a "core" and "extra" set of features.

Not sure what you mean by core vs extra features, but yeah if we just have some piece of YAML that defines what the API should be, we can run with it for READMEs, docs, changelogs, etc.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Dec 13, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants