Improves leader election so that we don't lose events during leadership changes #153
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details
Currently, when a replica loses its leadership, a new leader isn't elected until leaseDuration seconds. Here, that is 15s. The max time till we get a new leader is leaseDuration (15s) + retryPeriod (2s) = 17s.
This commit updates the shutdown process such that if the leader replica is sent a shutdown signal, it sleeps for leaseDuration seconds. This allows the leader replica to continue to export events until a new leader is elected. And a new leader is elected only if lease hasn't been renewed and leaseDuration expires.
In addition to this, other smaller changes include:
sync.WaitGroup
so that we cleanly shut down the process after informer has stopped.For use cases where no event loss is tolerable, users should use maxEventAgeSeconds to > 1.
Issues addressed
#34 is abandoned and this change takes care of what it was trying to do. Once this PR is shipped, #34 can be closed. I already discussed it with @xmcqueen and he's okay with me taking this forward.
Testing done
Tests
When leader election is disabled
No sleeping during shutdown
When leader election is enabled
We have 2 replicas
Non-leader instance stops right awat and doesn't wait for leaseDuration before stopping
When the non-leader instance is deleted, we get another replica
Leader instance waits for leaseDuration before stopping
The other pod becomes the new leader
When the leader replica,
event-exporter-7ddc6ff9b-fhcvq
, is deleted, the other pod becomes the leader.Leadership transition
It can be seen in the logs that the old leader pod shuts down at
2024-01-17T21:00:52Z
and the other pod becomes leader at2024-01-17T21:00:53Z
.