Improves leader election so that we don't lose events during leadership changes #153

ronaknnathani · 2024-01-17T21:10:53Z

Details

Currently, when a replica loses its leadership, a new leader isn't elected until leaseDuration seconds. Here, that is 15s. The max time till we get a new leader is leaseDuration (15s) + retryPeriod (2s) = 17s.

This commit updates the shutdown process such that if the leader replica is sent a shutdown signal, it sleeps for leaseDuration seconds. This allows the leader replica to continue to export events until a new leader is elected. And a new leader is elected only if lease hasn't been renewed and leaseDuration expires.

In addition to this, other smaller changes include:

Leader election now uses the leases object instead of configMaps and leases.
The clusterRole is also updated to allow writing to the leases object.
Informer start/stop is synchronized via sync.WaitGroup so that we cleanly shut down the process after informer has stopped.

For use cases where no event loss is tolerable, users should use maxEventAgeSeconds to > 1.

Issues addressed

Event Flow Interruption During Restart #24

#34 is abandoned and this change takes care of what it was trying to do. Once this PR is shipped, #34 can be closed. I already discussed it with @xmcqueen and he's okay with me taking this forward.

Testing done

Tests

go test -cover -mod=mod -v ./...
?   	github.com/resmoio/kubernetes-event-exporter	[no test files]
=== RUN   TestSimpleWriter
--- PASS: TestSimpleWriter (0.00s)
=== RUN   TestCorrectnessManyTimes
--- PASS: TestCorrectnessManyTimes (0.07s)
=== RUN   TestLargerThanBatchSize
--- PASS: TestLargerThanBatchSize (0.00s)
=== RUN   TestSimpleInterval
--- PASS: TestSimpleInterval (0.06s)
=== RUN   TestIntervalComplex
--- PASS: TestIntervalComplex (0.06s)
=== RUN   TestIntervalComplexAfterFlush
--- PASS: TestIntervalComplexAfterFlush (0.06s)
=== RUN   TestRetry
--- PASS: TestRetry (0.20s)
PASS
	github.com/resmoio/kubernetes-event-exporter/pkg/batch	coverage: 100.0% of statements
ok  	github.com/resmoio/kubernetes-event-exporter/pkg/batch	0.740s	coverage: 100.0% of statements
?   	github.com/resmoio/kubernetes-event-exporter/pkg/metrics	[no test files]
?   	github.com/resmoio/kubernetes-event-exporter/pkg/version	[no test files]
=== RUN   Test_ParseConfig
--- PASS: Test_ParseConfig (0.00s)
=== RUN   TestValidate_IsCheckingMaxEventAgeSeconds_WhenNotSet
{"level":"info","time":"2024-01-17T15:48:01-05:00","message":"setting config.maxEventAgeSeconds=5 (default)"}
{"level":"warn","time":"2024-01-17T15:48:01-05:00","message":"metrics name prefix is empty, setting config.metricsNamePrefix='event_exporter_' is recommended"}
--- PASS: TestValidate_IsCheckingMaxEventAgeSeconds_WhenNotSet (0.00s)
=== RUN   TestValidate_IsCheckingMaxEventAgeSeconds_WhenThrottledPeriodSet
--- PASS: TestValidate_IsCheckingMaxEventAgeSeconds_WhenThrottledPeriodSet (0.00s)
=== RUN   TestValidate_IsCheckingMaxEventAgeSeconds_WhenMaxEventAgeSecondsSet
--- PASS: TestValidate_IsCheckingMaxEventAgeSeconds_WhenMaxEventAgeSecondsSet (0.00s)
=== RUN   TestValidate_IsCheckingMaxEventAgeSeconds_WhenMaxEventAgeSecondsAndThrottledPeriodSet
--- PASS: TestValidate_IsCheckingMaxEventAgeSeconds_WhenMaxEventAgeSecondsAndThrottledPeriodSet (0.00s)
=== RUN   TestValidate_MetricsNamePrefix_WhenEmpty
--- PASS: TestValidate_MetricsNamePrefix_WhenEmpty (0.00s)
=== RUN   TestValidate_MetricsNamePrefix_WhenValid
--- PASS: TestValidate_MetricsNamePrefix_WhenValid (0.00s)
=== RUN   TestValidate_MetricsNamePrefix_WhenInvalid
--- PASS: TestValidate_MetricsNamePrefix_WhenInvalid (0.00s)
=== RUN   TestSetDefaults
--- PASS: TestSetDefaults (0.00s)
=== RUN   TestEngineNoRoutes
--- PASS: TestEngineNoRoutes (0.00s)
=== RUN   TestEngineSimple
--- PASS: TestEngineSimple (0.00s)
=== RUN   TestEngineDropSimple
--- PASS: TestEngineDropSimple (0.00s)
=== RUN   TestEmptyRoute
--- PASS: TestEmptyRoute (0.00s)
=== RUN   TestBasicRoute
--- PASS: TestBasicRoute (0.00s)
=== RUN   TestDropRule
--- PASS: TestDropRule (0.00s)
=== RUN   TestSingleLevelMultipleMatchRoute
--- PASS: TestSingleLevelMultipleMatchRoute (0.00s)
=== RUN   TestSubRoute
--- PASS: TestSubRoute (0.00s)
=== RUN   TestSubSubRoute
--- PASS: TestSubSubRoute (0.00s)
=== RUN   TestSubSubRouteWithDrop
--- PASS: TestSubSubRouteWithDrop (0.00s)
=== RUN   Test_GHIssue51
--- PASS: Test_GHIssue51 (0.00s)
=== RUN   TestEmptyRule
--- PASS: TestEmptyRule (0.00s)
=== RUN   TestBasicRule
--- PASS: TestBasicRule (0.00s)
=== RUN   TestBasicNoMatchRule
--- PASS: TestBasicNoMatchRule (0.00s)
=== RUN   TestBasicRegexRule
--- PASS: TestBasicRegexRule (0.00s)
=== RUN   TestLabelRegexRule
--- PASS: TestLabelRegexRule (0.00s)
=== RUN   TestOneLabelMatchesRule
--- PASS: TestOneLabelMatchesRule (0.00s)
=== RUN   TestOneLabelDoesNotMatchRule
--- PASS: TestOneLabelDoesNotMatchRule (0.00s)
=== RUN   TestTwoLabelMatchesRule
--- PASS: TestTwoLabelMatchesRule (0.00s)
=== RUN   TestTwoLabelRequiredRule
--- PASS: TestTwoLabelRequiredRule (0.00s)
=== RUN   TestTwoLabelRequiredOneMissingRule
--- PASS: TestTwoLabelRequiredOneMissingRule (0.00s)
=== RUN   TestOneAnnotationMatchesRule
--- PASS: TestOneAnnotationMatchesRule (0.00s)
=== RUN   TestOneAnnotationDoesNotMatchRule
--- PASS: TestOneAnnotationDoesNotMatchRule (0.00s)
=== RUN   TestTwoAnnotationsMatchesRule
--- PASS: TestTwoAnnotationsMatchesRule (0.00s)
=== RUN   TestTwoAnnotationsRequiredOneMissingRule
--- PASS: TestTwoAnnotationsRequiredOneMissingRule (0.00s)
=== RUN   TestComplexRuleNoMatch
--- PASS: TestComplexRuleNoMatch (0.00s)
=== RUN   TestComplexRuleMatches
--- PASS: TestComplexRuleMatches (0.00s)
=== RUN   TestComplexRuleAnnotationsNoMatch
--- PASS: TestComplexRuleAnnotationsNoMatch (0.00s)
=== RUN   TestComplexRuleMatchesRegexp
--- PASS: TestComplexRuleMatchesRegexp (0.00s)
=== RUN   TestComplexRuleNoMatchRegexp
--- PASS: TestComplexRuleNoMatchRegexp (0.00s)
=== RUN   TestMessageRegexp
--- PASS: TestMessageRegexp (0.00s)
=== RUN   TestCount
--- PASS: TestCount (0.00s)
PASS
	github.com/resmoio/kubernetes-event-exporter/pkg/exporter	coverage: 68.9% of statements
ok  	github.com/resmoio/kubernetes-event-exporter/pkg/exporter	1.635s	coverage: 68.9% of statements
=== RUN   TestEnhancedEvent_DeDot
=== RUN   TestEnhancedEvent_DeDot/nothing
=== RUN   TestEnhancedEvent_DeDot/dedot
--- PASS: TestEnhancedEvent_DeDot (0.00s)
    --- PASS: TestEnhancedEvent_DeDot/nothing (0.00s)
    --- PASS: TestEnhancedEvent_DeDot/dedot (0.00s)
=== RUN   TestEnhancedEvent_DeDot_MustNotAlternateOriginal
--- PASS: TestEnhancedEvent_DeDot_MustNotAlternateOriginal (0.00s)
=== RUN   TestEventWatcher_EventAge_whenEventCreatedBeforeStartup
--- PASS: TestEventWatcher_EventAge_whenEventCreatedBeforeStartup (0.00s)
=== RUN   TestEventWatcher_EventAge_whenEventCreatedAfterStartupAndBeforeMaxAge
--- PASS: TestEventWatcher_EventAge_whenEventCreatedAfterStartupAndBeforeMaxAge (0.00s)
=== RUN   TestEventWatcher_EventAge_whenEventCreatedAfterStartupAndAfterMaxAge
--- PASS: TestEventWatcher_EventAge_whenEventCreatedAfterStartupAndAfterMaxAge (0.00s)
=== RUN   TestOnEvent_WithObjectMetadata
--- PASS: TestOnEvent_WithObjectMetadata (0.00s)
=== RUN   TestOnEvent_DeletedObjects
--- PASS: TestOnEvent_DeletedObjects (0.00s)
PASS
	github.com/resmoio/kubernetes-event-exporter/pkg/kube	coverage: 29.7% of statements
ok  	github.com/resmoio/kubernetes-event-exporter/pkg/kube	0.614s	coverage: 29.7% of statements
=== RUN   Test_ParseConfigFromBytes_ExampleConfigIsCorrect
--- PASS: Test_ParseConfigFromBytes_ExampleConfigIsCorrect (0.00s)
=== RUN   Test_ParseConfigFromBytes_NoErrors
--- PASS: Test_ParseConfigFromBytes_NoErrors (0.00s)
=== RUN   Test_ParseConfigFromBytes_ErrorWhenCurlyBracesNotEscaped
--- PASS: Test_ParseConfigFromBytes_ErrorWhenCurlyBracesNotEscaped (0.00s)
=== RUN   Test_ParseConfigFromBytes_OkWhenCurlyBracesEscaped
--- PASS: Test_ParseConfigFromBytes_OkWhenCurlyBracesEscaped (0.00s)
=== RUN   Test_ParseConfigFromBytes_ErrorErrorNotWithCurlyBraces
--- PASS: Test_ParseConfigFromBytes_ErrorErrorNotWithCurlyBraces (0.00s)
PASS
	github.com/resmoio/kubernetes-event-exporter/pkg/setup	coverage: 100.0% of statements
ok  	github.com/resmoio/kubernetes-event-exporter/pkg/setup	2.660s	coverage: 100.0% of statements
=== RUN   TestOpsCenterSink_Send
=== RUN   TestOpsCenterSink_Send/Simple_Create
=== RUN   TestOpsCenterSink_Send/Invalid_Priority:_Want_err
--- PASS: TestOpsCenterSink_Send (0.00s)
    --- PASS: TestOpsCenterSink_Send/Simple_Create (0.00s)
    --- PASS: TestOpsCenterSink_Send/Invalid_Priority:_Want_err (0.00s)
=== RUN   TestTeams_Send
--- PASS: TestTeams_Send (0.00s)
=== RUN   TestTeams_Send_WhenTeamsReturnsRateLimited
--- PASS: TestTeams_Send_WhenTeamsReturnsRateLimited (0.00s)
=== RUN   TestLayoutConvert
--- PASS: TestLayoutConvert (0.00s)
PASS
	github.com/resmoio/kubernetes-event-exporter/pkg/sinks	coverage: 13.4% of statements
ok  	github.com/resmoio/kubernetes-event-exporter/pkg/sinks	3.337s	coverage: 13.4% of statements

When leader election is disabled

No sleeping during shutdown

➜  ~ kube logs event-exporter-7ddc6ff9b-rbs6w -f
{"level":"info","time":"2024-01-17T20:53:00Z","message":"Reading config file /data/config.yaml"}
{"level":"info","time":"2024-01-17T20:53:00Z","message":"Starting with config: {LogLevel:info LogFormat:json ThrottlePeriod:0 MaxEventAgeSeconds:0 ClusterName: Namespace: LeaderElection:{Enabled:false LeaderElectionID:} Route:{Drop:[] Match:[] Routes:[{Drop:[] Match:[{Labels:map[] Annotations:map[] Message: APIVersion: Kind: Namespace: Reason: Type: MinCount:0 Component: Host: Receiver:dump}] Routes:[]}]} Receivers:[{Name:dump InMemory:<nil> Webhook:<nil> File:<nil> Syslog:<nil> Stdout:0x40004023c0 Elasticsearch:<nil> Kinesis:<nil> Firehose:<nil> OpenSearch:<nil> Opsgenie:<nil> Loki:<nil> SQS:<nil> SNS:<nil> Slack:<nil> Kafka:<nil> Pubsub:<nil> Opscenter:<nil> Teams:<nil> BigQuery:<nil> EventBridge:<nil> Pipe:<nil>}] KubeQPS:5 KubeBurst:10 MetricsNamePrefix:event_exporter_ OmitLookup:false CacheSize:1024}"}
{"level":"info","time":"2024-01-17T20:53:00Z","message":"setting config.maxEventAgeSeconds=5 (default)"}
{"level":"info","time":"2024-01-17T20:53:00Z","message":"config.metricsNamePrefix='event_exporter_'"}
{"level":"info","name":"dump","type":"*sinks.Stdout","time":"2024-01-17T20:53:00Z","message":"Registering sink"}
{"level":"info","time":"2024-01-17T20:53:00Z","message":"levelinfomsgListening onaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:53:00Z","message":"levelinfomsgTLS is disabled.http2falseaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:53:00Z","message":"leader election disabled"}
{"metadata":{"name":"event-exporter-7ddc6ff9b-rbs6w.17ab3dc027e84cbd","namespace":"monitoring","uid":"63cea172-2877-4834-91d0-08a25b49bc5b","resourceVersion":"4094494","creationTimestamp":"2024-01-17T20:52:59Z"},"reason":"Scheduled","message":"Successfully assigned monitoring/event-exporter-7ddc6ff9b-rbs6w to kind-control-plane","source":{},"firstTimestamp":null,"lastTimestamp":null,"type":"Normal","eventTime":"2024-01-17T20:52:59.402379Z","action":"Binding","reportingComponent":"default-scheduler","reportingInstance":"default-scheduler-kind-control-plane","clusterName":"","involvedObject":{"kind":"Pod","namespace":"monitoring","name":"event-exporter-7ddc6ff9b-rbs6w","uid":"6ab948ac-9da6-42bc-9fa7-acebf96233e7","apiVersion":"v1","resourceVersion":"4094488","labels":{"app":"event-exporter","pod-template-hash":"7ddc6ff9b","version":"v1"},"annotations":{"prometheus.io/path":"/metrics","prometheus.io/port":"2112","prometheus.io/scrape":"true"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"event-exporter-7ddc6ff9b","uid":"ef0e28c8-f041-4757-9b60-d1ee0ed8844a","controller":true,"blockOwnerDeletion":true}],"deleted":false}}
...
{"level":"info","time":"2024-01-17T20:53:44Z","message":"Received signal to exit. Stopping."}
{"level":"info","time":"2024-01-17T20:53:44Z","message":"Closing sinks"}
...
{"level":"info","sink":"dump","time":"2024-01-17T20:53:44Z","message":"Closing the sink"}
{"level":"info","sink":"dump","time":"2024-01-17T20:53:44Z","message":"Closed"}
{"level":"info","time":"2024-01-17T20:53:44Z","message":"All sinks closed"}

When leader election is enabled

We have 2 replicas

➜  ~ kube get po
NAME                             READY   STATUS    RESTARTS   AGE
event-exporter-7ddc6ff9b-fhcvq   1/1     Running   0          101s <------- LEADER
event-exporter-7ddc6ff9b-hkzcw   1/1     Running   0          27s

Non-leader instance stops right awat and doesn't wait for leaseDuration before stopping

✗ kube logs event-exporter-7ddc6ff9b-hkzcw -f

{"level":"info","time":"2024-01-17T20:58:03Z","message":"Reading config file /data/config.yaml"}
{"level":"info","time":"2024-01-17T20:58:03Z","message":"Starting with config: {LogLevel:info LogFormat:json ThrottlePeriod:0 MaxEventAgeSeconds:0 ClusterName: Namespace: LeaderElection:{Enabled:true LeaderElectionID:} Route:{Drop:[] Match:[] Routes:[{Drop:[] Match:[{Labels:map[] Annotations:map[] Message: APIVersion: Kind: Namespace: Reason: Type: MinCount:0 Component: Host: Receiver:dump}] Routes:[]}]} Receivers:[{Name:dump InMemory:<nil> Webhook:<nil> File:<nil> Syslog:<nil> Stdout:0x4000417900 Elasticsearch:<nil> Kinesis:<nil> Firehose:<nil> OpenSearch:<nil> Opsgenie:<nil> Loki:<nil> SQS:<nil> SNS:<nil> Slack:<nil> Kafka:<nil> Pubsub:<nil> Opscenter:<nil> Teams:<nil> BigQuery:<nil> EventBridge:<nil> Pipe:<nil>}] KubeQPS:5 KubeBurst:10 MetricsNamePrefix:event_exporter_ OmitLookup:false CacheSize:1024}"}
{"level":"info","time":"2024-01-17T20:58:03Z","message":"setting config.maxEventAgeSeconds=5 (default)"}
{"level":"info","time":"2024-01-17T20:58:03Z","message":"config.metricsNamePrefix='event_exporter_'"}
{"level":"info","name":"dump","type":"*sinks.Stdout","time":"2024-01-17T20:58:03Z","message":"Registering sink"}
{"level":"info","time":"2024-01-17T20:58:03Z","message":"levelinfomsgListening onaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:58:03Z","message":"levelinfomsgTLS is disabled.http2falseaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:58:03Z","message":"leader election enabled"}
I0117 20:58:03.963426       1 leaderelection.go:248] attempting to acquire leader lease monitoring/kubernetes-event-exporter...
{"level":"info","time":"2024-01-17T20:58:03Z","message":"new leader observed: event-exporter-7ddc6ff9b-fhcvq_b624ce71-c9a4-4e01-9795-3b833c475ae3"}
{"level":"info","time":"2024-01-17T20:58:49Z","message":"Context was cancelled, stopping leader election loop"}
{"level":"info","time":"2024-01-17T20:58:49Z","message":"Received signal to exit. Stopping."}
{"level":"info","time":"2024-01-17T20:58:49Z","message":"Closing sinks"}
{"level":"info","sink":"dump","time":"2024-01-17T20:58:49Z","message":"Closing the sink"}
{"level":"info","sink":"dump","time":"2024-01-17T20:58:49Z","message":"Closed"}
{"level":"info","time":"2024-01-17T20:58:49Z","message":"All sinks closed"}

When the non-leader instance is deleted, we get another replica

➜  ~ kube get po
NAME                             READY   STATUS    RESTARTS   AGE
event-exporter-7ddc6ff9b-6vvgj   1/1     Running   0          93s
event-exporter-7ddc6ff9b-fhcvq   1/1     Running   0          3m33s <----- STILL THE LEADER

Leader instance waits for leaseDuration before stopping

➜  ~ kube logs event-exporter-7ddc6ff9b-fhcvq -f
{"level":"info","time":"2024-01-17T20:56:50Z","message":"Reading config file /data/config.yaml"}
{"level":"info","time":"2024-01-17T20:56:50Z","message":"Starting with config: {LogLevel:info LogFormat:json ThrottlePeriod:0 MaxEventAgeSeconds:0 ClusterName: Namespace: LeaderElection:{Enabled:true LeaderElectionID:} Route:{Drop:[] Match:[] Routes:[{Drop:[] Match:[{Labels:map[] Annotations:map[] Message: APIVersion: Kind: Namespace: Reason: Type: MinCount:0 Component: Host: Receiver:dump}] Routes:[]}]} Receivers:[{Name:dump InMemory:<nil> Webhook:<nil> File:<nil> Syslog:<nil> Stdout:0x4000497ce0 Elasticsearch:<nil> Kinesis:<nil> Firehose:<nil> OpenSearch:<nil> Opsgenie:<nil> Loki:<nil> SQS:<nil> SNS:<nil> Slack:<nil> Kafka:<nil> Pubsub:<nil> Opscenter:<nil> Teams:<nil> BigQuery:<nil> EventBridge:<nil> Pipe:<nil>}] KubeQPS:5 KubeBurst:10 MetricsNamePrefix:event_exporter_ OmitLookup:false CacheSize:1024}"}
{"level":"info","time":"2024-01-17T20:56:50Z","message":"setting config.maxEventAgeSeconds=5 (default)"}
{"level":"info","time":"2024-01-17T20:56:50Z","message":"config.metricsNamePrefix='event_exporter_'"}
{"level":"info","name":"dump","type":"*sinks.Stdout","time":"2024-01-17T20:56:50Z","message":"Registering sink"}
{"level":"info","time":"2024-01-17T20:56:50Z","message":"levelinfomsgListening onaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:56:50Z","message":"levelinfomsgTLS is disabled.http2falseaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:56:50Z","message":"leader election enabled"}
I0117 20:56:50.138717       1 leaderelection.go:248] attempting to acquire leader lease monitoring/kubernetes-event-exporter...
{"level":"info","time":"2024-01-17T20:56:50Z","message":"new leader observed: event-exporter-7ddc6ff9b-qdfqk_c4e05734-4456-4ab0-b310-3551e4557e9d"}
I0117 20:57:05.385050       1 leaderelection.go:258] successfully acquired lease monitoring/kubernetes-event-exporter
{"level":"info","time":"2024-01-17T20:57:05Z","message":"new leader observed: event-exporter-7ddc6ff9b-fhcvq_b624ce71-c9a4-4e01-9795-3b833c475ae3"}
{"level":"info","time":"2024-01-17T20:57:05Z","message":"leader election won"}
{"metadata":{"name":"event-exporter.17ab3e06eafa3243","namespace":"monitoring","uid":"16ee7e87-d582-4454-943c-e0657c90786f","resourceVersion":"4094980","creationTimestamp":"2024-01-17T20:58:03Z"},"reason":"ScalingReplicaSet","message":"Scaled up replica set event-exporter-7ddc6ff9b to 2 from 1","source":{"component":"deployment-controller"},"firstTimestamp":"2024-01-17T20:58:03Z","lastTimestamp":"2024-01-17T20:58:03Z","count":1,"type":"Normal","eventTime":null,"reportingComponent":"deployment-controller","reportingInstance":"","clusterName":"","involvedObject":{"kind":"Deployment","namespace":"monitoring","name":"event-exporter","uid":"6357931f-4225-47cd-904f-c22f4ce524bf","apiVersion":"apps/v1","resourceVersion":"4094977","annotations":{"deployment.kubernetes.io/revision":"1","kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"apps/v1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"name\":\"event-exporter\",\"namespace\":\"monitoring\"},\"spec\":{\"replicas\":1,\"selector\":{\"matchLabels\":{\"app\":\"event-exporter\",\"version\":\"v1\"}},\"strategy\":{\"rollingUpdate\":{\"maxSurge\":0,\"maxUnavailable\":1},\"type\":\"RollingUpdate\"},\"template\":{\"metadata\":{\"annotations\":{\"prometheus.io/path\":\"/metrics\",\"prometheus.io/port\":\"2112\",\"prometheus.io/scrape\":\"true\"},\"labels\":{\"app\":\"event-exporter\",\"version\":\"v1\"}},\"spec\":{\"containers\":[{\"args\":[\"-conf=/data/config.yaml\"],\"image\":\"kubernetes-event-exporter:local\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"event-exporter\",\"securityContext\":{\"allowPrivilegeEscalation\":false,\"capabilities\":{\"drop\":[\"ALL\"]}},\"volumeMounts\":[{\"mountPath\":\"/data\",\"name\":\"cfg\"}]}],\"securityContext\":{\"runAsNonRoot\":true,\"seccompProfile\":{\"type\":\"RuntimeDefault\"}},\"serviceAccountName\":\"event-exporter\",\"volumes\":[{\"configMap\":{\"name\":\"event-exporter-cfg\"},\"name\":\"cfg\"}]}}}}\n"},"deleted":false}}
...
{"level":"info","time":"2024-01-17T21:00:37Z","message":"Context was cancelled, stopping leader election loop"}
{"level":"info","time":"2024-01-17T21:00:37Z","message":"waiting leaseDuration seconds before stopping: 15s"}
...
{"metadata":{"name":"event-exporter-7ddc6ff9b-s6zfg.17ab3e2b101bc05a","namespace":"monitoring","uid":"e054b91c-92ca-4331-8b61-27f26ea503c4","resourceVersion":"4095306","creationTimestamp":"2024-01-17T21:00:38Z"},"reason":"Started","message":"Started container event-exporter","source":{"component":"kubelet","host":"kind-control-plane"},"firstTimestamp":"2024-01-17T21:00:38Z","lastTimestamp":"2024-01-17T21:00:38Z","count":1,"type":"Normal","eventTime":null,"reportingComponent":"kubelet","reportingInstance":"kind-control-plane","clusterName":"","involvedObject":{"kind":"Pod","namespace":"monitoring","name":"event-exporter-7ddc6ff9b-s6zfg","uid":"fe1e9f75-3118-43cf-b48c-1ac347152a6b","apiVersion":"v1","resourceVersion":"4095296","fieldPath":"spec.containers{event-exporter}","labels":{"app":"event-exporter","pod-template-hash":"7ddc6ff9b","version":"v1"},"annotations":{"prometheus.io/path":"/metrics","prometheus.io/port":"2112","prometheus.io/scrape":"true"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"event-exporter-7ddc6ff9b","uid":"dd51869d-b43a-43ad-b1ac-8821572496b6","controller":true,"blockOwnerDeletion":true}],"deleted":false}}
...
{"level":"info","time":"2024-01-17T21:00:52Z","message":"Received signal to exit. Stopping."}
{"level":"info","time":"2024-01-17T21:00:52Z","message":"Closing sinks"}
{"level":"info","sink":"dump","time":"2024-01-17T21:00:52Z","message":"Closing the sink"}
{"level":"info","sink":"dump","time":"2024-01-17T21:00:52Z","message":"Closed"}
{"level":"info","time":"2024-01-17T21:00:52Z","message":"All sinks closed"}

The other pod becomes the new leader
When the leader replica, event-exporter-7ddc6ff9b-fhcvq, is deleted, the other pod becomes the leader.

✗ kube logs event-exporter-7ddc6ff9b-6vvgj -f

{"level":"info","time":"2024-01-17T20:58:50Z","message":"Reading config file /data/config.yaml"}
{"level":"info","time":"2024-01-17T20:58:50Z","message":"Starting with config: {LogLevel:info LogFormat:json ThrottlePeriod:0 MaxEventAgeSeconds:0 ClusterName: Namespace: LeaderElection:{Enabled:true LeaderElectionID:} Route:{Drop:[] Match:[] Routes:[{Drop:[] Match:[{Labels:map[] Annotations:map[] Message: APIVersion: Kind: Namespace: Reason: Type: MinCount:0 Component: Host: Receiver:dump}] Routes:[]}]} Receivers:[{Name:dump InMemory:<nil> Webhook:<nil> File:<nil> Syslog:<nil> Stdout:0x4000368920 Elasticsearch:<nil> Kinesis:<nil> Firehose:<nil> OpenSearch:<nil> Opsgenie:<nil> Loki:<nil> SQS:<nil> SNS:<nil> Slack:<nil> Kafka:<nil> Pubsub:<nil> Opscenter:<nil> Teams:<nil> BigQuery:<nil> EventBridge:<nil> Pipe:<nil>}] KubeQPS:5 KubeBurst:10 MetricsNamePrefix:event_exporter_ OmitLookup:false CacheSize:1024}"}
{"level":"info","time":"2024-01-17T20:58:50Z","message":"setting config.maxEventAgeSeconds=5 (default)"}
{"level":"info","time":"2024-01-17T20:58:50Z","message":"config.metricsNamePrefix='event_exporter_'"}
{"level":"info","name":"dump","type":"*sinks.Stdout","time":"2024-01-17T20:58:50Z","message":"Registering sink"}
{"level":"info","time":"2024-01-17T20:58:50Z","message":"levelinfomsgListening onaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:58:50Z","message":"levelinfomsgTLS is disabled.http2falseaddress[::]:2112"}
{"level":"info","time":"2024-01-17T20:58:50Z","message":"leader election enabled"}
I0117 20:58:50.372635       1 leaderelection.go:248] attempting to acquire leader lease monitoring/kubernetes-event-exporter...
{"level":"info","time":"2024-01-17T20:58:50Z","message":"new leader observed: event-exporter-7ddc6ff9b-fhcvq_b624ce71-c9a4-4e01-9795-3b833c475ae3"}
I0117 21:00:53.972282       1 leaderelection.go:258] successfully acquired lease monitoring/kubernetes-event-exporter
{"level":"info","time":"2024-01-17T21:00:53Z","message":"new leader observed: event-exporter-7ddc6ff9b-6vvgj_e3d6ae20-1980-4218-b797-537ac942e5ea"}
{"level":"info","time":"2024-01-17T21:00:53Z","message":"leader election won"}
{"level":"warn","event age":"16.978700554s","event namespace":"monitoring","event name":"event-exporter-7ddc6ff9b-fhcvq.17ab3e2ae97a4b57","time":"2024-01-17T21:00:53Z","message":"Event discarded as being older then maxEventAgeSeconds"}

Leadership transition
It can be seen in the logs that the old leader pod shuts down at 2024-01-17T21:00:52Z and the other pod becomes leader at 2024-01-17T21:00:53Z.

Currently, when a replica loses its leadership, a new leader isn't elected until leaseDuration seconds. Here, that is 15s. The max time till we get a new leader is leaseDuration (15s) + retryPeriod (2s) = 17s. This commit updates the shutdown process such that if the leader replica is sent a shutdown signal, it sleeps for leaseDuration seconds. This allows the leader replica to continue to export events until a new leader is elected. And a new leader is elected only if lease hasn't been renewed and leaseDuration expires. In addition to this, leader election now uses the leases object instead of configMaps and leases. The clusterRole is also updated to allow writing to the leases object. For use cases where no event loss is tolerable, users should use maxEventAgeSeconds to > 1.

ronaknnathani · 2024-01-17T21:13:59Z

@mustafaakin Can I get a review for this?

#34 is abandoned and this change takes care of what it was trying to do. Once this PR is shipped, #34 can be closed. I already discussed it with @xmcqueen and he's okay with me taking this forward.

Let me know if you have any questions.

Thanks!

ronaknnathani · 2024-01-23T02:47:06Z

@mustafaakin gentle ping. any thoughts on this? if you're busy, is there someone else on the team who can review this?

mustafaakin · 2024-02-23T05:25:51Z

Thanks a lot for a detailed PR. I wanted to switch to Lease at some point and there are issues where events are missed so I think it would resolve at least a few of them just due to lag in leader election.

This was referenced Jan 17, 2024

added strict caching and leader replacement that minimizes interruption to event stream #34

Open

Event Flow Interruption During Restart #24

Open

mustafaakin approved these changes Feb 23, 2024

View reviewed changes

mustafaakin merged commit d7ba03a into resmoio:master Feb 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves leader election so that we don't lose events during leadership changes #153

Improves leader election so that we don't lose events during leadership changes #153

ronaknnathani commented Jan 17, 2024 •

edited

Loading

ronaknnathani commented Jan 17, 2024

ronaknnathani commented Jan 23, 2024

mustafaakin commented Feb 23, 2024

Improves leader election so that we don't lose events during leadership changes #153

Improves leader election so that we don't lose events during leadership changes #153

Conversation

ronaknnathani commented Jan 17, 2024 • edited Loading

Details

Issues addressed

Testing done

ronaknnathani commented Jan 17, 2024

ronaknnathani commented Jan 23, 2024

mustafaakin commented Feb 23, 2024

ronaknnathani commented Jan 17, 2024 •

edited

Loading