Skip to content

TestGlobalCA/assert_operator_has_been_restarted_once_more flaky in resilience pipeline #6356

Open

Description

In TestGlobalCA, we expect the operator to restart when the CA entry is updated in the operator-config ConfigMap. Sometimes, the operator Pod is never restarted.

--- FAIL: TestGlobalCA/assert_operator_has_been_restarted_once_more

Received unexpected error: operator restart count was 0 but expected at least 2

What's going on? Let's see the full log from the eck-diagnostic:

{"Time":"2023-01-14T13:33:39.264Z","Test":"TestGlobalCA/reset_operator_to_use_self-signed_certificates_per_resource"}
{"Time":"2023-01-14T13:33:39.280Z","Test":"TestGlobalCA/assert_operator_has_been_restarted_once_more"}
{"Time":"2023-01-14T13:33:53.181Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:33:58.160Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:34:18.168Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:42:53.179Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:42:58.166Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:43:13.169Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:51:53.177Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:51:58.161Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:52:13.214Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:54:53.171Z","log.logger":"chaos","message":"Change operator replicas","sts_name":"e2e-79zbd-operator","current_replicas":1,"new_replicas":3}
{"Time":"2023-01-14T13:54:58.170Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1"]}
{"Time":"2023-01-14T13:55:03.178Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:00:53.182Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T14:00:58.169Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:01:13.175Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-1"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:03:39.280Z","Test":"TestGlobalCA/assert_operator_has_been_restarted_once_more","Output":"Retries (30m0s timeout): .."}

The resilience pipeline has the specificity of executing the full e2e-tests suite with a ChaosJob running in the background and randomly deleting Pods throughout the execution.

I think the flakyness here comes from a bad timing when the ChaosJob deletes an operator pod just before or shortly after it restarts. If it is recreated before, it no longer needs to restart, if it is recreated after, the restart counter is reset to zero. In the two cases, the assertion "operator has been restarted once more" fails.

Remediation:

  • disable this specific test for this specific pipeline (pipeline
  • change how we assert that the operator Pod is new, either restarted or recreated

The simplest seems to disable the test. We have the pipeline name in the test context so it should be as simple as:

diff --git a/test/e2e/global_ca_test.go b/test/e2e/global_ca_test.go
index cbf9373605..ea343736b4 100644
--- a/test/e2e/global_ca_test.go
+++ b/test/e2e/global_ca_test.go
@@ -32,6 +32,14 @@ import (
 )

 func TestGlobalCA(t *testing.T) {
+
+       // Skip if it is the resilience pipeline because the ChaosJob can prevent
+       // assert_operator_has_been_restarted_once_more to pass when it deletes an operator Pod
+       // exactly on restart.
+       if test.Ctx().Pipeline == "e2e/resilience" {
+               t.Skip()
+       }
+
        k := test.NewK8sClientOrFatal()
        name := "global-ca"
        es := elasticsearch.NewBuilder(name).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    >bugSomething isn't working>testRelated to unit/integration/e2e tests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions