Skip to content

Conversation

tmshort
Copy link
Contributor

@tmshort tmshort commented Jan 11, 2024

Fix #3151

The Reason check for CatalogSourcesUnhealthy does not take into account that CatalogSourcesAdded may overwrite the AllCatalogSourcesHealthy reason because the number of catalog sources changed. So, depending on when reconcilliation happens relative to the catalog sources being added, the initial Reason may be either Added or Healthy.

At the end of the test, also depending on when reconcilliation happens, the new Reason may be Added or Healthy.

  • Both could be the same (Added or Healthy)
  • Or, the original could be Added and the new could be Healthy

New can't be Added after the original is Healthy because no additional catalog sources are added once the test starts.

However, the Status and Message (which is the real indicator of health) must match. And because of the possible update, don't check the timestamps.

Also:

  • Clean up some looping log messages
  • Clean up some logging added when comments were converted

These comments/logs are at the beginning of the test, and are also part of the test sequence, so they are redundant (and possibly confusing)

Description of the change:

Motivation for the change:

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2024
// The Reason check:
// CatalogSources[Added|Deleted] could overwrite AllCatalogSourcesHealthy, but leave Status/Message preserved
// Added/Deleted could subesequently be overwritten by Healthy
// This all depends on when reconciliation occurs and if the number of catalog sources changed between reconcilliations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This all depends on when reconciliation occurs and if the number of catalog sources changed between reconcilliations
// This all depends on when reconciliation occurs and if the number of catalog sources changed between reconciliations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh... no spell checker in my code editor!

if lastCond.Status != cond.Status && lastCond.Reason != cond.Reason && lastCond.Message == cond.Message {
GinkgoT().Logf("waited %s subscription condition met %v\n", time.Since(lastTime), cond)
lastTime = time.Now()
lastCond = cond
}
return true
}

fmt.Printf("subscription condition not met: %v\n", cond)
if lastCond.Status != cond.Status && lastCond.Reason != cond.Reason && lastCond.Message == cond.Message {
GinkgoT().Logf("waited %s subscription condition not met: %v\n", time.Since(lastTime), cond)
lastTime = time.Now()
lastCond = cond
}
return false
Copy link
Contributor

@grokspawn grokspawn Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love if there were a way to write this for greater readability.
Maybe a comment something like

// if status/reason/message meet expectations, then subscription state is considered met/true IFF this is the result of a recent change of status/reason
// else, cache the current status/reason for next loop/comparison

I also have a concern that this makes this test edge-set instead of level-set. I'm not sure how purist we need to be for an e2e test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the general trend to reduce repetitive logging, and I admit, the comparisons are ugly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUT... that's not the reason for the change... The change reduces the logging when conditions don't change, the reason for getting out of the loop doesn't change, just the output.

require.FailNowf(GinkgoT(), "failed", "subscription contains unexpected installplan condition: %v", cond)
case operatorsv1alpha1.SubscriptionInstallPlanMissing:
require.Equal(GinkgoT(), operatorsv1alpha1.ReferencedInstallPlanNotFound, cond.Reason)
case operatorsv1alpha1.SubscriptionCatalogSourcesUnhealthy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this test is written makes it sensitive to any change in the cluster state or the reconciliation flow of the controller. Since we're trying to test a negative ("all of the non-InstallPlan conditions can change between the time we looked at them in the beginning of the test and now"), I don't know that this test is super powerful, and, since that kind of approach is very fragile, I wonder if a different approach exists (maybe just to delete the test? what is it testing?).

Consider the case where a CatalogSource happens to blip unhealthy for a moment during this check. Why should the test fail if the reconciler has correctly exposed that downtime in status?


By(`Ensure original non-InstallPlan status conditions remain after InstallPlan transitions`)
hashEqual := comparison.NewHashEqualitor()
By(`Ensure original, specific, non-InstallPlan status conditions remain after InstallPlan transitions`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By(`Ensure original, specific, non-InstallPlan status conditions remain after InstallPlan transitions`)
By(`Ensure InstallPlan-related status conditions match what we're expecting`)

Fix operator-framework#3151

Remove non-InstallPlan related checks for this test.

Also:
* Clean up some looping log messages
* Clean up some logging added when comments were converted

These comments/logs are at the beginning of the test, and are also part
of the test sequence, so they are redundant (and possibly confusing)

Signed-off-by: Todd Short <todd.short@me.com>
Copy link

openshift-ci bot commented Jan 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stevekuznetsov stevekuznetsov added this pull request to the merge queue Jan 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 12, 2024
@tmshort tmshort added this pull request to the merge queue Jan 12, 2024
@stevekuznetsov
Copy link
Contributor

The test failed in that e2e run @tmshort

 • [FAILED] [359.116 seconds]
Install Plan [It] creation with permissions
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2709

  Captured StdOut/StdErr Output >>
  waiting 101.868485ms for catalog pod install-plan-e2e-l7ng6/nginx-catalogzwjbt to be available (for sync) - NO_CONNECTION
  waiting 1.40117196s for catalog pod install-plan-e2e-l7ng6/nginx-catalogzwjbt to be available (for sync) - IDLE
  waiting 199.948461ms for catalog pod install-plan-e2e-l7ng6/nginx-catalogzwjbt to be available (for sync) - TRANSIENT_FAILURE
  waiting 45.296298808s for catalog pod install-plan-e2e-l7ng6/nginx-catalogzwjbt to be available (for sync) - READY
  probing catalog nginx-catalogzwjbt pod with address nginx-catalogzwjbt.install-plan-e2e-l7ng6.svc:50051
  skipping health check
  Using the kubectl kubectl binary
  Using the artifacts/install-plan-e2e-l7ng6 output directory
  Storing the test artifact output in the artifacts/install-plan-e2e-l7ng6 directory
  Collecting get catalogsources -o yaml output...
  Collecting get subscriptions -o yaml output...
  Collecting get operatorgroups -o yaml output...
  Collecting get clusterserviceversions -o yaml output...
  Collecting get installplans -o yaml output...
  Collecting get pods -o wide output...
  Collecting get events --sort-by .lastTimestamp output...
  << Captured StdOut/StdErr Output

  Timeline >>
  created the install-plan-e2e-l7ng6 testing namespace
  created the install-plan-e2e-l7ng6/install-plan-e2e-l7ng6-operatorgroup operator group
  STEP: This It spec creates an InstallPlan with a CSV containing a set of permissions to be resolved. @ 01/12/24 17:49:32.616
  STEP: Create manifests @ 01/12/24 17:49:32.616
  STEP: Create new CRDs @ 01/12/24 17:49:32.616
  STEP: Defer CRD clean up @ 01/12/24 17:49:32.616
  STEP: Generate permissions @ 01/12/24 17:49:32.616
  STEP: Permissions must be different than ClusterPermissions defined below if OLM is going to lift role/rolebindings to cluster level. @ 01/12/24 17:49:32.616
  STEP: Generate permissions @ 01/12/24 17:49:32.616
  STEP: Create a new NamedInstallStrategy @ 01/12/24 17:49:32.616
  STEP: Create an nginx details deployment @ 01/12/24 17:49:32.616
  STEP: Create new CSVs @ 01/12/24 17:49:32.616
  STEP: set a simple default strategy if none given @ 01/12/24 17:49:32.616
  STEP: Populate owned and required @ 01/12/24 17:49:32.616
  STEP: Create CatalogSource @ 01/12/24 17:49:32.616
  Creating catalog source nginx-catalogzwjbt in namespace install-plan-e2e-l7ng6...
  Catalog source nginx-catalogzwjbt created
  STEP: Attempt to get CatalogSource @ 01/12/24 17:49:32.625
  STEP: Creating a Subscription @ 01/12/24 17:50:19.625
  STEP: Attempt to get Subscription @ 01/12/24 17:50:19.827
  17:50:22.8253: subscription install-plan-e2e-l7ng6/sub-nginx-jckqm state: UpgradePending (csv nginx9wkw7-stable): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"install-plan-e2e-l7ng6", Name:"install-9q6nf", UID:"0d618462-b41b-47c9-b196-55c400f76e94", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"6870", FieldPath:""}
  waiting 2.998272267s for subscription install-plan-e2e-l7ng6/sub-nginx-jckqm to have installplan ref: has ref &v1.ObjectReference{Kind:"InstallPlan", Namespace:"install-plan-e2e-l7ng6", Name:"install-9q6nf", UID:"0d618462-b41b-47c9-b196-55c400f76e94", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"6870", FieldPath:""}
  STEP: Attempt to get InstallPlan @ 01/12/24 17:50:22.825
  waiting 200.764917ms for installplan install-plan-e2e-l7ng6/install-9q6nf to be phases [Failed Complete], in phase Installing
  waiting 4.00105276s for installplan install-plan-e2e-l7ng6/install-9q6nf to be phases [Failed Complete], in phase Complete
  STEP: Expect correct RBAC resources to be resolved and created @ 01/12/24 17:50:27.027
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterServiceVersion", Namespace:""}, expected registry.ResourceKey{Name:"insq7nc7.cluster.com", Kind:"CustomResourceDefinition", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterServiceVersion", Namespace:""}, expected registry.ResourceKey{Name:"nginx-satwc5x", Kind:"ServiceAccount", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterServiceVersion", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"Role", Namespace:""}: name has prefix: true kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterServiceVersion", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"RoleBinding", Namespace:""}: name has prefix: true kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterServiceVersion", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRole", Namespace:""}: name has prefix: true kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterServiceVersion", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRoleBinding", Namespace:""}: name has prefix: true kinds match false
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.027
  Found unexpected step registry.ResourceKey{Name:"insq7nc7.cluster.com", Kind:"CustomResourceDefinition", Namespace:""}, expected registry.ResourceKey{Name:"nginx-satwc5x", Kind:"ServiceAccount", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"insq7nc7.cluster.com", Kind:"CustomResourceDefinition", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"Role", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"insq7nc7.cluster.com", Kind:"CustomResourceDefinition", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"RoleBinding", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"insq7nc7.cluster.com", Kind:"CustomResourceDefinition", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRole", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"insq7nc7.cluster.com", Kind:"CustomResourceDefinition", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRoleBinding", Namespace:""}: name has prefix: false kinds match false
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.028
  Found unexpected step registry.ResourceKey{Name:"nginx-satwc5x", Kind:"ServiceAccount", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRole", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx-satwc5x", Kind:"ServiceAccount", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRoleBinding", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx-satwc5x", Kind:"ServiceAccount", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"Role", Namespace:""}: name has prefix: false kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx-satwc5x", Kind:"ServiceAccount", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"RoleBinding", Namespace:""}: name has prefix: false kinds match false
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.028
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable-nginx--8mRWRTLyg3nxeqD1FhDlo3kgPA9kzEwxxNItgB", Kind:"Role", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRoleBinding", Namespace:""}: name has prefix: true kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable-nginx--8mRWRTLyg3nxeqD1FhDlo3kgPA9kzEwxxNItgB", Kind:"Role", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"RoleBinding", Namespace:""}: name has prefix: true kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable-nginx--8mRWRTLyg3nxeqD1FhDlo3kgPA9kzEwxxNItgB", Kind:"Role", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRole", Namespace:""}: name has prefix: true kinds match false
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.028
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable-nginx--8mRWRTLyg3nxeqD1FhDlo3kgPA9kzEwxxNItgB", Kind:"RoleBinding", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRole", Namespace:""}: name has prefix: true kinds match false
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable-nginx--8mRWRTLyg3nxeqD1FhDlo3kgPA9kzEwxxNItgB", Kind:"RoleBinding", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRoleBinding", Namespace:""}: name has prefix: true kinds match false
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.131
  Found unexpected step registry.ResourceKey{Name:"nginx9wkw7-stable-VXLzI72iJaE5o2F2iQIMJnlP45TwaviggOXQY", Kind:"ClusterRole", Namespace:""}, expected registry.ResourceKey{Name:"nginx9wkw7-stable", Kind:"ClusterRoleBinding", Namespace:""}: name has prefix: true kinds match false
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.237
  STEP: This operator was installed into a global operator group, so the roles should have been lifted to clusterroles @ 01/12/24 17:50:27.237
  STEP: Should have removed every matching step @ 01/12/24 17:50:27.237
  STEP: Explicitly deleting subscription install-plan-e2e-l7ng6/sub-nginx-jckqm @ 01/12/24 17:50:27.237
  STEP: Looking for no error OR IsNotFound error @ 01/12/24 17:50:27.244
  STEP: Waiting for the Subscription to delete @ 01/12/24 17:50:27.244
  waiting for subscription install-plan-e2e-l7ng6/sub-nginx-jckqm to delete
  subscription install-plan-e2e-l7ng6/sub-nginx-jckqm deleted
  STEP: Explicitly deleting csv install-plan-e2e-l7ng6/nginx9wkw7-stable @ 01/12/24 17:50:27.425
  STEP: Looking for no error OR IsNotFound error @ 01/12/24 17:50:27.629
  STEP: Waiting for the CSV to delete @ 01/12/24 17:50:27.629
  waiting for csv install-plan-e2e-l7ng6/nginx9wkw7-stable to delete
  waited 796.858535ms for csv install-plan-e2e-l7ng6/nginx9wkw7-stable status: Pending (RequirementsNotMet): one or more requirements couldn't be found
  waited 801.340402ms for csv install-plan-e2e-l7ng6/nginx9wkw7-stable status: InstallReady (AllRequirementsMet): all requirements found, attempting install
  waited 198.545724ms for csv install-plan-e2e-l7ng6/nginx9wkw7-stable status: Installing (InstallSucceeded): waiting for install components to report healthy
  waited 201.613406ms for csv install-plan-e2e-l7ng6/nginx9wkw7-stable status: Installing (InstallWaiting): installing: waiting for deployment dep-tpvgt to become ready: deployment "dep-tpvgt" not available: Deployment does not have minimum availability.
  waited 1.199015945s for csv install-plan-e2e-l7ng6/nginx9wkw7-stable status: Succeeded (InstallSucceeded): install strategy completed with no errors
  Cleaning catalog source nginx-catalogzwjbt
  Deleting config map nginx-catalogzwjbt-configmap...
  Deleting catalog source nginx-catalogzwjbt...
  waiting for the catalog source nginx-catalogzwjbt-vssb6 pod to be deleted...
  Done cleaning catalog source nginx-catalogzwjbt
  [FAILED] in [It] - /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2888 @ 01/12/24 17:55:29.236
  collecting the install-plan-e2e-l7ng6 namespace artifacts as the 'creation with permissions' test case failed
  collecting logs in the ./artifacts/ artifacts directory
  tearing down the install-plan-e2e-l7ng6 namespace
  resetting e2e kube client
  deleting install-plan-e2e-l7ng6/install-plan-e2e-l7ng6-operatorgroup
  deleting <global>/install-plan-e2e-l7ng6
  garbage collecting CRDs
  << Timeline

  [FAILED] 
  	Error Trace:	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2888
  	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/onsi/ginkgo/v2/internal/node.go:463
  	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:865
  	            				/opt/hostedtoolcache/go/1.20.12/x64/src/runtime/asm_amd64.s:1598
  	Error:      	Received unexpected error:
  	            	timed out waiting for the condition
  	Test:       	Install Plan creation with permissions
  
  In [It] at: /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2888 @ 01/12/24 17:55:29.236

  Full Stack Trace
    github.com/stretchr/testify/require.NoError({0x7f6cdcd40740, 0xc000ffe3f0}, {0x42dff40, 0xc00022e120}, {0x0, 0x0, 0x0})
    	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/stretchr/testify/require/require.go:1357 +0x96
    github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func16.13()
    	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2888 +0x230e

@tmshort
Copy link
Contributor Author

tmshort commented Jan 12, 2024

That should be including the (new) finalizer code in the merge. AFAICT, it's failing on the delete of the CSV (not on the removal of the CRBs/CRs/SA), as in the original failure.

The merge queue was also failing on unit test, so something was wonky.

Merged via the queue into operator-framework:master with commit 5299830 Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flake] End-to-end.[It] Subscription can reconcile InstallPlan status
3 participants