create qualification tests for images in staging #144

listx · 2019-10-21T16:36:50Z

As noted in kubernetes/k8s.io#406 (review), the staging image up for promotion should have a link to it being tested (maybe some more thorough e2e tests going through various flags, etc). This could be a Prow postsubmit job. That job could run against master branch, and for every update to it:

pushes up a new version to the staging repository
run various e2e tests (running the same e2e test suite we already have)

Then the logs for this job itself could be linked-to for every promotion PR promoting this newly-tested version to prod.

The text was updated successfully, but these errors were encountered:

rajibmitra · 2019-10-29T10:28:01Z

I would like to work on the prow job if its ok @listx @rikatz

rajibmitra · 2019-10-29T11:33:54Z

/assign rajibmitra

rikatz · 2019-10-29T11:40:22Z

For me it’s ok :) I just want to follow the work also (as a curious person!) Em ter, 29 de out de 2019 às 07:28, Rajib Mitra <notifications@github.com> escreveu:

…

I would like to work on the prow job if its ok @listx <https://github.com/listx> @rikatz <https://github.com/rikatz> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#144>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWZQBN5OBK3RC3GP7644E3QRAF3FANCNFSM4JDAQWQA> .

listx · 2019-10-29T17:07:24Z

Sounds good to me. Please write a design doc of some sort to sketch out your thoughts first before pushing up an implementation. You can then share it with https://groups.google.com/forum/#!newtopic/kubernetes-wg-k8s-infra

fejta-bot · 2020-01-27T17:42:11Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rajibmitra · 2020-01-27T18:43:09Z

/remove-lifecycle stale

rikatz · 2020-01-27T19:15:55Z

@rajibmitra do you need something (any help, insights) to write that document? :)

fejta-bot · 2020-04-26T19:50:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-05-26T20:37:29Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-06-25T21:18:10Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-06-25T21:18:17Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

listx · 2020-06-30T02:14:59Z

/reopen

The work that @yodahekinsew is doing touches on this area, as he is expanding the number of various tests to check against staging images. See https://docs.google.com/document/d/1D5FEU58j3_V8Fxl98I3vcqNCxzpD9cV9Gtpxf9wGO4I/edit#heading=h.4bti0lhcgu9m.

k8s-ci-robot · 2020-06-30T02:15:06Z

@listx: Reopened this issue.

In response to this:

/reopen

The work that @yodahekinsew is doing touches on this area, as he is expanding the number of various tests to check against staging images. See https://docs.google.com/document/d/1D5FEU58j3_V8Fxl98I3vcqNCxzpD9cV9Gtpxf9wGO4I/edit#heading=h.4bti0lhcgu9m.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-07-30T02:37:08Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-07-30T02:37:15Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2021-03-15T06:46:05Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-15T06:46:09Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

listx · 2021-08-06T00:37:08Z

/reopen
/remove-lifecycle rotten
/lifecycle active

Intro

OK so we got bitten by the lack of QA tests for staging images with kubernetes/test-infra#23148 (reverted in kubernetes/test-infra#23151). There are several things going on, but let me capture what's happening today. You can think of this comment as a mini-postmortem doc.

Current state of affairs

Step 1: Code changes to CIP repo

We merge PRs into this repo to update the master branch. When that branch moves (a PR is merged to improve the CIP tool and other relating things), post-cip-push-image-cip runs (config here) and a new image appears in gcr.io/k8s-staging-artifact-promoter/cip-auditor.

Step 2: Promotion from staging to prod

Later, when the powers that be deem that the image should be promoted to production, we create a PR to "promote the promoter", like this. When the PR is merged, the staging image is promoted to prod (via post-k8sio-image-promo).

Step 3: Update of Prow Job configuration

Finally, we can now also update existing Prow Jobs that rely on the promoter to use this new production version. A PR to update the Prow Jobs is created, to bump the image version used, like this.

The fire/problem

Now consider kubernetes/test-infra#23148. Alas, it needed to be reverted. The problem (captured in the logs) was that the CLI flag handling logic changed between the old version used prior to this PR (version v2.4.1) and the one proposed by this PR (version v20210805-v1.337.0-136-g8f6bfde). (Additionally, @justaugustus reported that we also had not yet fixed the --dry-run behavior yet in this version, but I won't be going into that here.) So in summary, even if the promoter image in isolation was well-formed, its use in the Prow Job configuration was incorrect. That is, kubernetes/test-infra#23148 should have included changes to the CLI arguments passed into the image.

See https://kubernetes.slack.com/archives/CJH2GBF7Y/p1628202615135000 for initial thoughts on this problem.

Proposed changes

So how do we prevent this problem in the future?

Apart from a more formalized process for designating new promoter images for production (involving more review from the Release Engineering group), there are additional changes we can make:

1. We should create a Prow Job configuration test to enforce rules for how we invoke the promoter image (e.g., make sure that the arguments we're passing into the image are correct), so that when we attempt to bump a version of the promoter in a Prow Job, that change can be tested. We can use the pattern of creating a unit test in Prow (see this example). This will automatically act as a presubmit for Step 3 above.
2. We can additionally have a similar test (perhaps using https://github.com/GoogleContainerTools/container-structure-test as a framework) to check CLI arguments as part of Step 2. That is, we could have a new presubmit job that watches the promoter's own promotion manifest file and then run a test that performs QA against the to-be-promoted-to-production image. The lack of such a test was the reason why this GitHub issue was originally created. As per the OP, this test could be implemented as a postsubmit against the CIP codebase instead of the promoter manifest for it as described earlier, although in the former case we'd have to probably run a sleep loop to wait until there is a staging image available for testing (that is, wait for post-cip-push-image-cip to finish before executing).
3. We can also add unit tests in the CIP repo directly that also check for CLI arguments/flag handling. This will force us to bump version numbers, etc. whenever these tests break (or because there are simply newer features). The unit tests here are somewhat removed from the original concern of this postmortem, but they would still bring up issues regarding CLI arguments worth knowing about.

I think we should do all 3 of the above, in the order they are listed. But how can we maximize code reuse across all 3 changes? It would be painful to have 3 separate sets of tests, requiring us to update each of these separately. I assume changes 2 and 3 can somehow share code, but I'm not sure about the first.

listx · 2021-08-06T00:37:25Z

/assign @listx

justaugustus · 2021-08-06T03:32:29Z

So it's in my queue to help with:
/assign

FYI @kubernetes-sigs/release-engineering on #144 (comment).

listx · 2021-08-06T05:18:30Z

@justaugustus I think I can help with adding a unit test in Prow (proposed change 1). I'm going to be working more in Prow anyway so it's a good exercise.

tylerferrara · 2021-08-06T15:57:51Z

Two questions:
Will these proposed solutions get their own issues?
In Step 3, are we suggesting the addition of another verify script to enforce a standard API? If so, I like this idea. As you mentioned, this API definition should be referenced in Step 1. Therefore, changing the API would only require a single file change.

Thinking out loud: In addition, we may want to reference this API in the README for new users to rely on specific behavior. New releases could ensure a set of static features, while incrementally adding new flags. This could be thought of as a "core" and "extra" set of features.

listx · 2021-08-06T17:13:08Z

Two questions:
Will these proposed solutions get their own issues?

Ideally yes; I'll try to keep this thread synced with any new issues related to this area.

In Step 3, are we suggesting the addition of another verify script to enforce a standard API? If so, I like this idea. As you mentioned, this API definition should be referenced in Step 1. Therefore, changing the API would only require a single file change.

Yeah we could use some sort of standardized API definition for CIP and related tools. Maybe that "source of truth" will be in the form of a YAML configuration defined for https://github.com/GoogleContainerTools/container-structure-test. (Or is just rolling our own "CLI API" YAML definition better? IDK)

Thinking out loud: In addition, we may want to reference this API in the README for new users to rely on specific behavior. New releases could ensure a set of static features, while incrementally adding new flags. This could be thought of as a "core" and "extra" set of features.

Not sure what you mean by core vs extra features, but yeah if we just have some piece of YAML that defines what the API should be, we can run with it for READMEs, docs, changelogs, etc.

k8s-triage-robot · 2021-12-13T18:51:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-12T19:01:12Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-11T19:59:11Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-11T19:59:23Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

listx mentioned this issue Oct 21, 2019

promoter: add 20191021-v2.3.1-34-ge32056e kubernetes/k8s.io#406

Merged

rikatz mentioned this issue Oct 29, 2019

Create a prow job that scans proposed images against vulnerabilities kubernetes/k8s.io#436

Closed

k8s-ci-robot assigned rajibmitra Oct 29, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2020

k8s-ci-robot closed this as completed Jun 25, 2020

k8s-ci-robot reopened this Jun 30, 2020

k8s-ci-robot closed this as completed Jul 30, 2020

k8s-ci-robot closed this as completed Mar 15, 2021

listx reopened this Aug 6, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 6, 2021

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Aug 6, 2021

k8s-ci-robot assigned listx Aug 6, 2021

listx unassigned rajibmitra Aug 6, 2021

k8s-ci-robot assigned justaugustus Aug 6, 2021

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Dec 13, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 12, 2022

k8s-ci-robot closed this as completed Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create qualification tests for images in staging #144

create qualification tests for images in staging #144

listx commented Oct 21, 2019

rajibmitra commented Oct 29, 2019

rajibmitra commented Oct 29, 2019

rikatz commented Oct 29, 2019 via email

listx commented Oct 29, 2019

fejta-bot commented Jan 27, 2020

rajibmitra commented Jan 27, 2020

rikatz commented Jan 27, 2020

fejta-bot commented Apr 26, 2020

fejta-bot commented May 26, 2020

fejta-bot commented Jun 25, 2020

k8s-ci-robot commented Jun 25, 2020

listx commented Jun 30, 2020

k8s-ci-robot commented Jun 30, 2020

fejta-bot commented Jul 30, 2020

k8s-ci-robot commented Jul 30, 2020

fejta-bot commented Mar 15, 2021

k8s-ci-robot commented Mar 15, 2021

listx commented Aug 6, 2021 •

edited

Loading

listx commented Aug 6, 2021

justaugustus commented Aug 6, 2021

listx commented Aug 6, 2021

tylerferrara commented Aug 6, 2021

listx commented Aug 6, 2021

k8s-triage-robot commented Dec 13, 2021

k8s-triage-robot commented Jan 12, 2022

k8s-triage-robot commented Feb 11, 2022

k8s-ci-robot commented Feb 11, 2022

create qualification tests for images in staging #144

create qualification tests for images in staging #144

Comments

listx commented Oct 21, 2019

rajibmitra commented Oct 29, 2019

rajibmitra commented Oct 29, 2019

rikatz commented Oct 29, 2019 via email

listx commented Oct 29, 2019

fejta-bot commented Jan 27, 2020

rajibmitra commented Jan 27, 2020

rikatz commented Jan 27, 2020

fejta-bot commented Apr 26, 2020

fejta-bot commented May 26, 2020

fejta-bot commented Jun 25, 2020

k8s-ci-robot commented Jun 25, 2020

listx commented Jun 30, 2020

k8s-ci-robot commented Jun 30, 2020

fejta-bot commented Jul 30, 2020

k8s-ci-robot commented Jul 30, 2020

fejta-bot commented Mar 15, 2021

k8s-ci-robot commented Mar 15, 2021

listx commented Aug 6, 2021 • edited Loading

Intro

Current state of affairs

Step 1: Code changes to CIP repo

Step 2: Promotion from staging to prod

Step 3: Update of Prow Job configuration

The fire/problem

Proposed changes

listx commented Aug 6, 2021

justaugustus commented Aug 6, 2021

listx commented Aug 6, 2021

tylerferrara commented Aug 6, 2021

listx commented Aug 6, 2021

k8s-triage-robot commented Dec 13, 2021

k8s-triage-robot commented Jan 12, 2022

k8s-triage-robot commented Feb 11, 2022

k8s-ci-robot commented Feb 11, 2022

listx commented Aug 6, 2021 •

edited

Loading