Skip to content

OCPBUGS-74343: Fix OTE panic and configure test timeout and parallelism for test suites#390

Open
sunzhaohua2 wants to merge 2 commits intoopenshift:mainfrom
sunzhaohua2:timeout
Open

OCPBUGS-74343: Fix OTE panic and configure test timeout and parallelism for test suites#390
sunzhaohua2 wants to merge 2 commits intoopenshift:mainfrom
sunzhaohua2:timeout

Conversation

@sunzhaohua2
Copy link
Contributor

@sunzhaohua2 sunzhaohua2 commented Feb 9, 2026

When running e2e tests with OTE, tests were failing with panic errors, failed job

/go/src/github.com/openshift/cluster-control-plane-machine-set-operator/test/e2e/helpers/cases.go:216 +0x3e

fail [runtime/panic.go:262]: Test Panicked: runtime error: invalid memory address or nil pointer dereference
fail [runtime/panic.go:262]: Test Panicked

This is because when calling BuildExtensionTestSpecsFromOpenShiftGinkgoSuite(), helpers.Helper(framework.GlobalFramework) is called, but framework.GlobalFramework is still nil, BeforeAll has not been called yet.

PR 384 moved framework initialization before building test specs, but caused info/list command failed. slack discussion , this is because there is no cluster connection for info/list commands.

This pr change framework.GlobalFramework at runtime instead of definition time, make sure tests don't panic .

Also set the default test timeout to 90 minutes, by default it's 15m.
openshift/origin:

	timeout := o.Timeout
	if timeout == 0 {
		timeout = suite.TestTimeout
	}
	if timeout == 0 {
		timeout = 15 * time.Minute
	}

Test no cluster connection for info/list:

zhsun:cluster-control-plane-machine-set-operator/ (timeout✗) $ bin/cluster-control-plane-machine-set-operator-ext info                            
{
    "apiVersion": "v1.1",
    "source": {
        "commit": "",
        "build_date": "",
        "git_tree_state": ""
    },
    "component": {
        "product": "openshift",
        "type": "payload",
        "name": "cluster-control-plane-machine-set-operator"
    },
    "suites": [
        {
            "name": "cpmso/periodic",
            "description": "",
            "qualifiers": [
                "(source == \"openshift:payload:cluster-control-plane-machine-set-operator\") \u0026\u0026 (labels.exists(l, l == \"Periodic\"))"
            ],
            "parallelism": 1,
            "clusterStability": "Disruptive",
            "testTimeout": 5400000000000
        },
        {
            "name": "cpmso/presubmit",
            "description": "",
            "qualifiers": [
                "(source == \"openshift:payload:cluster-control-plane-machine-set-operator\") \u0026\u0026 (labels.exists(l, l == \"PreSubmit\"))"
            ],
            "parallelism": 1,
            "clusterStability": "Disruptive",
            "testTimeout": 5400000000000
        }
    ],
    "images": null
}

zhsun:cluster-control-plane-machine-set-operator/ (timeout✗) $ bin/cluster-control-plane-machine-set-operator-ext list tests --suite cpmso/presubmit | grep "name" | wc -l
      15

Test have cluster connection for run-test :

zhsun:cluster-control-plane-machine-set-operator/ (timeout✗) $ ./bin/cluster-control-plane-machine-set-operator-ext run-test -n "ControlPlaneMachineSet Operator With an active ControlPlaneMachineSet and the provider spec of index 1 is not as expected should rolling update replace the outdated machine"
  Running Suite:  - /Users/zhsun/go/src/github.com/openshift/cluster-control-plane-machine-set-operator
  =====================================================================================================
  Random Seed: 1770690515 - will randomize all specs

  Will run 1 of 1 specs
  ------------------------------
  ControlPlaneMachineSet Operator With an active ControlPlaneMachineSet and the provider spec of index 1 is not as expected should rolling update replace the outdated machine [PreSubmit, Disruptive, Serial]
  /Users/zhsun/go/src/github.com/openshift/cluster-control-plane-machine-set-operator/test/e2e/helpers/cases.go:129
    STEP: Waiting for the cluster operators to stabilise (minimum availability time: 1m0s, timeout: 10m0s, polling interval: 10s) @ 02/10/26 10:28:36.809
    STEP: Checking the control plane machine set exists @ 02/10/26 10:28:37.05
    STEP: Checking the control plane machine set is active @ 02/10/26 10:28:37.272
    STEP: Updating the provider spec of the control plane machine at index 1 @ 02/10/26 10:28:37.876
    STEP: Waiting for the updated replicas to equal desired replicas @ 02/10/26 10:28:38.567
    STEP: Waiting for the index 1 to be replaced @ 02/10/26 10:28:38.567
    STEP: Checking the number of control plane machines never goes above 4 replicas @ 02/10/26 10:28:38.567
    STEP: Checking that other indexes (not 1) do not have 2 replicas @ 02/10/26 10:28:38.568
    STEP: Checking that index 1 has 2 replicas @ 02/10/26 10:28:39.005
    STEP: Correct index is being replaced @ 02/10/26 10:28:39.228
    STEP: Index 1 replacement created @ 02/10/26 10:28:39.228
    STEP: Checking the replacement machine for index 1 @ 02/10/26 10:28:39.228
    STEP: Checking the replacement machine name @ 02/10/26 10:28:39.453
    STEP: Replacement machine name is "quaytest-15651-bv24h-master-r7fvx-1" @ 02/10/26 10:28:39.676
    STEP: Replacement machine name is correct @ 02/10/26 10:28:39.676
    STEP: Waiting for the new machine become Running @ 02/10/26 10:28:39.676
    STEP: Checking that the old machine is not deleted until the new machine is Running @ 02/10/26 10:28:39.676
    STEP: Replacement machine is Running @ 02/10/26 10:31:17.75
    STEP: Checking that the old machine is marked for deletion @ 02/10/26 10:31:17.75
    STEP: Updated replicas is now equal to desired replicas @ 02/10/26 10:31:17.9
    STEP: Waiting for the replicas to equal desired replicas @ 02/10/26 10:31:17.9
    STEP: Checking that the old machine is removed @ 02/10/26 10:31:17.975
    STEP: Rollout of index 1 complete @ 02/10/26 10:43:55.657
    STEP: Replacement for index 1 is complete @ 02/10/26 10:43:55.657
    STEP: Replicas is now equal to desired replicas @ 02/10/26 10:43:55.671
    STEP: Control plane machine rollout completed successfully @ 02/10/26 10:43:55.671
    STEP: Waiting for the cluster to stabilise after the rollout @ 02/10/26 10:43:55.671
    STEP: Waiting for the cluster operators to stabilise (minimum availability time: 2m0s, timeout: 32m0s, polling interval: 30s) @ 02/10/26 10:43:55.671
    STEP: Cluster stabilised after the rollout @ 02/10/26 10:59:40.522
  • [1863.716 seconds]
  ------------------------------

  Ran 1 of 1 Specs in 1863.723 seconds
  SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped

@openshift-ci openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 9, 2026
@openshift-ci openshift-ci bot requested review from chrischdi and damdo February 9, 2026 14:08
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@miyadav
Copy link
Member

miyadav commented Feb 9, 2026

/hold as we reverted -
#384

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2026
@openshift-ci openshift-ci bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 10, 2026
@sunzhaohua2
Copy link
Contributor Author

/test pull-ci-openshift-origin-main-e2e-aws-ovn-microshift

@sunzhaohua2 sunzhaohua2 marked this pull request as draft February 10, 2026 04:23
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026
@sunzhaohua2 sunzhaohua2 marked this pull request as ready for review February 10, 2026 05:12
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026
@sunzhaohua2 sunzhaohua2 changed the title Configure test timeout and parallelism for CPMS test suites OCPBUGS-74343: Fix OTE panic and configure test timeout and parallelism for test suites Feb 11, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 11, 2026
@openshift-ci-robot
Copy link

@sunzhaohua2: This pull request references Jira Issue OCPBUGS-74343, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When running e2e tests with OTE, tests were failing with panic errors, failed job

/go/src/github.com/openshift/cluster-control-plane-machine-set-operator/test/e2e/helpers/cases.go:216 +0x3e

fail [runtime/panic.go:262]: Test Panicked: runtime error: invalid memory address or nil pointer dereference
fail [runtime/panic.go:262]: Test Panicked

This is because when calling BuildExtensionTestSpecsFromOpenShiftGinkgoSuite(), helpers.Helper(framework.GlobalFramework) is called, but framework.GlobalFramework is still nil, BeforeAll has not been called yet.

PR 384 moved framework initialization before building test specs, but caused info/list command failed. slack discussion , this is because there is no cluster connection for info/list commands.

This pr change framework.GlobalFramework at runtime instead of definition time, make sure tests don't panic .

Also set the default test timeout to 90 minutes, by default it's 15m.
openshift/origin:

  timeout := o.Timeout
  if timeout == 0 {
  	timeout = suite.TestTimeout
  }
  if timeout == 0 {
  	timeout = 15 * time.Minute
  }

Test no cluster connection for info/list:

zhsun:cluster-control-plane-machine-set-operator/ (timeout✗) $ bin/cluster-control-plane-machine-set-operator-ext info                            
{
   "apiVersion": "v1.1",
   "source": {
       "commit": "",
       "build_date": "",
       "git_tree_state": ""
   },
   "component": {
       "product": "openshift",
       "type": "payload",
       "name": "cluster-control-plane-machine-set-operator"
   },
   "suites": [
       {
           "name": "cpmso/periodic",
           "description": "",
           "qualifiers": [
               "(source == \"openshift:payload:cluster-control-plane-machine-set-operator\") \u0026\u0026 (labels.exists(l, l == \"Periodic\"))"
           ],
           "parallelism": 1,
           "clusterStability": "Disruptive",
           "testTimeout": 5400000000000
       },
       {
           "name": "cpmso/presubmit",
           "description": "",
           "qualifiers": [
               "(source == \"openshift:payload:cluster-control-plane-machine-set-operator\") \u0026\u0026 (labels.exists(l, l == \"PreSubmit\"))"
           ],
           "parallelism": 1,
           "clusterStability": "Disruptive",
           "testTimeout": 5400000000000
       }
   ],
   "images": null
}

zhsun:cluster-control-plane-machine-set-operator/ (timeout✗) $ bin/cluster-control-plane-machine-set-operator-ext list tests --suite cpmso/presubmit | grep "name" | wc -l
     15

Test have cluster connection for run-test :

zhsun:cluster-control-plane-machine-set-operator/ (timeout✗) $ ./bin/cluster-control-plane-machine-set-operator-ext run-test -n "ControlPlaneMachineSet Operator With an active ControlPlaneMachineSet and the provider spec of index 1 is not as expected should rolling update replace the outdated machine"
 Running Suite:  - /Users/zhsun/go/src/github.com/openshift/cluster-control-plane-machine-set-operator
 =====================================================================================================
 Random Seed: 1770690515 - will randomize all specs

 Will run 1 of 1 specs
 ------------------------------
 ControlPlaneMachineSet Operator With an active ControlPlaneMachineSet and the provider spec of index 1 is not as expected should rolling update replace the outdated machine [PreSubmit, Disruptive, Serial]
 /Users/zhsun/go/src/github.com/openshift/cluster-control-plane-machine-set-operator/test/e2e/helpers/cases.go:129
   STEP: Waiting for the cluster operators to stabilise (minimum availability time: 1m0s, timeout: 10m0s, polling interval: 10s) @ 02/10/26 10:28:36.809
   STEP: Checking the control plane machine set exists @ 02/10/26 10:28:37.05
   STEP: Checking the control plane machine set is active @ 02/10/26 10:28:37.272
   STEP: Updating the provider spec of the control plane machine at index 1 @ 02/10/26 10:28:37.876
   STEP: Waiting for the updated replicas to equal desired replicas @ 02/10/26 10:28:38.567
   STEP: Waiting for the index 1 to be replaced @ 02/10/26 10:28:38.567
   STEP: Checking the number of control plane machines never goes above 4 replicas @ 02/10/26 10:28:38.567
   STEP: Checking that other indexes (not 1) do not have 2 replicas @ 02/10/26 10:28:38.568
   STEP: Checking that index 1 has 2 replicas @ 02/10/26 10:28:39.005
   STEP: Correct index is being replaced @ 02/10/26 10:28:39.228
   STEP: Index 1 replacement created @ 02/10/26 10:28:39.228
   STEP: Checking the replacement machine for index 1 @ 02/10/26 10:28:39.228
   STEP: Checking the replacement machine name @ 02/10/26 10:28:39.453
   STEP: Replacement machine name is "quaytest-15651-bv24h-master-r7fvx-1" @ 02/10/26 10:28:39.676
   STEP: Replacement machine name is correct @ 02/10/26 10:28:39.676
   STEP: Waiting for the new machine become Running @ 02/10/26 10:28:39.676
   STEP: Checking that the old machine is not deleted until the new machine is Running @ 02/10/26 10:28:39.676
   STEP: Replacement machine is Running @ 02/10/26 10:31:17.75
   STEP: Checking that the old machine is marked for deletion @ 02/10/26 10:31:17.75
   STEP: Updated replicas is now equal to desired replicas @ 02/10/26 10:31:17.9
   STEP: Waiting for the replicas to equal desired replicas @ 02/10/26 10:31:17.9
   STEP: Checking that the old machine is removed @ 02/10/26 10:31:17.975
   STEP: Rollout of index 1 complete @ 02/10/26 10:43:55.657
   STEP: Replacement for index 1 is complete @ 02/10/26 10:43:55.657
   STEP: Replicas is now equal to desired replicas @ 02/10/26 10:43:55.671
   STEP: Control plane machine rollout completed successfully @ 02/10/26 10:43:55.671
   STEP: Waiting for the cluster to stabilise after the rollout @ 02/10/26 10:43:55.671
   STEP: Waiting for the cluster operators to stabilise (minimum availability time: 2m0s, timeout: 32m0s, polling interval: 30s) @ 02/10/26 10:43:55.671
   STEP: Cluster stabilised after the rollout @ 02/10/26 10:59:40.522
 • [1863.716 seconds]
 ------------------------------

 Ran 1 of 1 Specs in 1863.723 seconds
 SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sunzhaohua2
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 28, 2026

@sunzhaohua2: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-etcd-scaling b6ad8ee link true /test e2e-aws-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants