-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent test: Leader should not be able to elect itself #286
Conversation
1ee0e9d
to
3686fd9
Compare
} | ||
|
||
// If this a resignation from _this_ client, ignore the change. | ||
if notification.LeaderID == e.clientID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a little noise in this PR because I extracted this function so I could test it more easily, but this is the relevant clause that was added here. I also added the early context check just below, but everything else just moves around and was preexisting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix!
This one's aimed at fixing a rare intermittent test observed here [1]: --- FAIL: TestElector_WithNotifier (0.00s) --- FAIL: TestElector_WithNotifier/LosesLeadership (0.21s) logger.go:225: time=2024-03-23T05:34:33.983Z level=INFO msg="Notifier: Listener connecting" elector_test.go:228: Starting test-client-id logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Elector: Listening for leadership changes" client_id=test-client-id topic=river_leadership logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership logger.go:225: time=2024-03-23T05:34:33.985Z level=INFO msg="Elector: Attempting to gain leadership" client_id=test-client-id logger.go:225: time=2024-03-23T05:34:34.055Z level=INFO msg="Elector: Gained leadership" client_id=test-client-id elector_test.go:232: Force resigning test-client-id logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Received notification from notifier" action=resigned client_id=test-client-id logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Current leader attempting reelect" client_id=test-client-id elector_test.go:243: Error Trace: /home/runner/work/river/river/internal/riverinternaltest/testfactory/test_factory.go:91 /home/runner/work/river/river/internal/leadership/elector_test.go:243 Error: Received unexpected error: ERROR: duplicate key value violates unique constraint "river_leader_pkey" (SQLSTATE 23505) Test: TestElector_WithNotifier/LosesLeadership This test case tries to confirm that an elector will lose leadership by force resigning it then inserting a leader record representing another client. In this case it failed because by the time we tried to insert our new leader record, the elector that we'd forcefully resigned had already re-elected itself, thereby causing the unique constraint violation. This happens because when a resignation notification is sent, it's actually received by all clients including the one that just resigned, and although this resignation usually gets ignored because we're busy shutting down, that's not necessarily the case and it's possible for a client to reelect itself. Here, correct the problem by ignoring resignation notifications from the same client ID. Presumably if a client just resigned it's for a reason, and we should immediately try to reelect. In case of a degenerate case where a client resigns, no other clients are present to be elected leader, and the client did not intend to shut down (should never, ever happen), the client can still reelect itself after it's poll timeout five seconds later, so even in the worst case scenario, we're still protected. Add a little more test coverage on leadership change function. [1] https://github.com/riverqueue/river/actions/runs/8399744481/job/23006194307
3686fd9
to
29de83a
Compare
ty. |
This one's aimed at fixing a rare intermittent test observed here [1]:
This test case tries to confirm that an elector will lose leadership by
force resigning it then inserting a leader record representing another
client. In this case it failed because by the time we tried to insert
our new leader record, the elector that we'd forcefully resigned had
already re-elected itself, thereby causing the unique constraint
violation.
This happens because when a resignation notification is sent, it's
actually received by all clients including the one that just resigned,
and although this resignation usually gets ignored because we're busy
shutting down, that's not necessarily the case and it's possible for a
client to reelect itself.
Here, correct the problem by ignoring resignation notifications from the
same client ID. Presumably if a client just resigned it's for a reason,
and we should immediately try to reelect. In case of a degenerate case
where a client resigns, no other clients are present to be elected
leader, and the client did not intend to shut down (should never, ever
happen), the client can still reelect itself after it's poll timeout
five seconds later, so even in the worst case scenario, we're still
protected.
Add a little more test coverage on leadership change function.
[1] https://github.com/riverqueue/river/actions/runs/8399744481/job/23006194307