TestGlobalCA/assert_operator_has_been_restarted_once_more flaky in resilience pipeline

In `TestGlobalCA`, we expect the operator to restart when the CA entry is updated in the operator-config ConfigMap. Sometimes, the operator Pod is never restarted.

- https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-resilience/680
- https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/59

```
--- FAIL: TestGlobalCA/assert_operator_has_been_restarted_once_more

Received unexpected error: operator restart count was 0 but expected at least 2
```

What's going on? Let's see the full log from the eck-diagnostic:

```
{"Time":"2023-01-14T13:33:39.264Z","Test":"TestGlobalCA/reset_operator_to_use_self-signed_certificates_per_resource"}
{"Time":"2023-01-14T13:33:39.280Z","Test":"TestGlobalCA/assert_operator_has_been_restarted_once_more"}
{"Time":"2023-01-14T13:33:53.181Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:33:58.160Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:34:18.168Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:42:53.179Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:42:58.166Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:43:13.169Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:51:53.177Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:51:58.161Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:52:13.214Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:54:53.171Z","log.logger":"chaos","message":"Change operator replicas","sts_name":"e2e-79zbd-operator","current_replicas":1,"new_replicas":3}
{"Time":"2023-01-14T13:54:58.170Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1"]}
{"Time":"2023-01-14T13:55:03.178Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:00:53.182Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T14:00:58.169Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:01:13.175Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-1"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:03:39.280Z","Test":"TestGlobalCA/assert_operator_has_been_restarted_once_more","Output":"Retries (30m0s timeout): .."}
```

The resilience pipeline has the specificity of executing the full e2e-tests suite with a ChaosJob running in the background and randomly deleting Pods throughout the execution.

I think the flakyness here comes from a bad timing when the ChaosJob deletes an operator pod just before or shortly after it restarts. If it is recreated before, it no longer needs to restart, if it is recreated after, the restart counter is reset to zero. In the two cases, the assertion "operator has been restarted once more" fails.

Remediation: 
- disable this specific test for this specific pipeline (pipeline
- change how we assert that the operator Pod is new, either restarted or recreated

The simplest seems to disable the test. We have the pipeline name in the test context so it should be as simple as:

```diff
diff --git a/test/e2e/global_ca_test.go b/test/e2e/global_ca_test.go
index cbf9373605..ea343736b4 100644
--- a/test/e2e/global_ca_test.go
+++ b/test/e2e/global_ca_test.go
@@ -32,6 +32,14 @@ import (
 )

 func TestGlobalCA(t *testing.T) {
+
+       // Skip if it is the resilience pipeline because the ChaosJob can prevent
+       // assert_operator_has_been_restarted_once_more to pass when it deletes an operator Pod
+       // exactly on restart.
+       if test.Ctx().Pipeline == "e2e/resilience" {
+               t.Skip()
+       }
+
        k := test.NewK8sClientOrFatal()
        name := "global-ca"
        es := elasticsearch.NewBuilder(name).
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestGlobalCA/assert_operator_has_been_restarted_once_more flaky in resilience pipeline #6356

thbkrkr
openedon Jan 25, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TestGlobalCA/assert_operator_has_been_restarted_once_more flaky in resilience pipeline #6356

Description

thbkrkropenedon Jan 25, 2023

Metadata