Description
openedon Jan 25, 2023
In TestGlobalCA
, we expect the operator to restart when the CA entry is updated in the operator-config ConfigMap. Sometimes, the operator Pod is never restarted.
- https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-resilience/680
- https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/59
--- FAIL: TestGlobalCA/assert_operator_has_been_restarted_once_more
Received unexpected error: operator restart count was 0 but expected at least 2
What's going on? Let's see the full log from the eck-diagnostic:
{"Time":"2023-01-14T13:33:39.264Z","Test":"TestGlobalCA/reset_operator_to_use_self-signed_certificates_per_resource"}
{"Time":"2023-01-14T13:33:39.280Z","Test":"TestGlobalCA/assert_operator_has_been_restarted_once_more"}
{"Time":"2023-01-14T13:33:53.181Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:33:58.160Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:34:18.168Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:42:53.179Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:42:58.166Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:43:13.169Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:51:53.177Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T13:51:58.161Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:52:13.214Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0"]}
{"Time":"2023-01-14T13:54:53.171Z","log.logger":"chaos","message":"Change operator replicas","sts_name":"e2e-79zbd-operator","current_replicas":1,"new_replicas":3}
{"Time":"2023-01-14T13:54:58.170Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1"]}
{"Time":"2023-01-14T13:55:03.178Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-0"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:00:53.182Z","log.logger":"chaos","message":"Deleting operator","pod_name":"e2e-79zbd-operator-0"}
{"Time":"2023-01-14T14:00:58.169Z","log.logger":"chaos","message":"Elected operator","elected":[],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:01:13.175Z","log.logger":"chaos","message":"Elected operator","elected":["e2e-79zbd-operator-1"],"all":["e2e-79zbd-operator-0","e2e-79zbd-operator-1","e2e-79zbd-operator-2"]}
{"Time":"2023-01-14T14:03:39.280Z","Test":"TestGlobalCA/assert_operator_has_been_restarted_once_more","Output":"Retries (30m0s timeout): .."}
The resilience pipeline has the specificity of executing the full e2e-tests suite with a ChaosJob running in the background and randomly deleting Pods throughout the execution.
I think the flakyness here comes from a bad timing when the ChaosJob deletes an operator pod just before or shortly after it restarts. If it is recreated before, it no longer needs to restart, if it is recreated after, the restart counter is reset to zero. In the two cases, the assertion "operator has been restarted once more" fails.
Remediation:
- disable this specific test for this specific pipeline (pipeline
- change how we assert that the operator Pod is new, either restarted or recreated
The simplest seems to disable the test. We have the pipeline name in the test context so it should be as simple as:
diff --git a/test/e2e/global_ca_test.go b/test/e2e/global_ca_test.go
index cbf9373605..ea343736b4 100644
--- a/test/e2e/global_ca_test.go
+++ b/test/e2e/global_ca_test.go
@@ -32,6 +32,14 @@ import (
)
func TestGlobalCA(t *testing.T) {
+
+ // Skip if it is the resilience pipeline because the ChaosJob can prevent
+ // assert_operator_has_been_restarted_once_more to pass when it deletes an operator Pod
+ // exactly on restart.
+ if test.Ctx().Pipeline == "e2e/resilience" {
+ t.Skip()
+ }
+
k := test.NewK8sClientOrFatal()
name := "global-ca"
es := elasticsearch.NewBuilder(name).