Fix intermittent test: Leader should not be able to elect itself #286

brandur · 2024-03-23T06:22:46Z

This one's aimed at fixing a rare intermittent test observed here [1]:

--- FAIL: TestElector_WithNotifier (0.00s)
    --- FAIL: TestElector_WithNotifier/LosesLeadership (0.21s)
        logger.go:225: time=2024-03-23T05:34:33.983Z level=INFO msg="Notifier: Listener connecting"
        elector_test.go:228: Starting test-client-id
        logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Elector: Listening for leadership changes" client_id=test-client-id topic=river_leadership
        logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership
        logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership
        logger.go:225: time=2024-03-23T05:34:33.985Z level=INFO msg="Elector: Attempting to gain leadership" client_id=test-client-id
        logger.go:225: time=2024-03-23T05:34:34.055Z level=INFO msg="Elector: Gained leadership" client_id=test-client-id
        elector_test.go:232: Force resigning test-client-id
        logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Received notification from notifier" action=resigned client_id=test-client-id
        logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Current leader attempting reelect" client_id=test-client-id
        elector_test.go:243:
                Error Trace:	/home/runner/work/river/river/internal/riverinternaltest/testfactory/test_factory.go:91
                                                        /home/runner/work/river/river/internal/leadership/elector_test.go:243
                Error:      	Received unexpected error:
                                ERROR: duplicate key value violates unique constraint "river_leader_pkey" (SQLSTATE 23505)
                Test:       	TestElector_WithNotifier/LosesLeadership

This test case tries to confirm that an elector will lose leadership by
force resigning it then inserting a leader record representing another
client. In this case it failed because by the time we tried to insert
our new leader record, the elector that we'd forcefully resigned had
already re-elected itself, thereby causing the unique constraint
violation.

This happens because when a resignation notification is sent, it's
actually received by all clients including the one that just resigned,
and although this resignation usually gets ignored because we're busy
shutting down, that's not necessarily the case and it's possible for a
client to reelect itself.

Here, correct the problem by ignoring resignation notifications from the
same client ID. Presumably if a client just resigned it's for a reason,
and we should immediately try to reelect. In case of a degenerate case
where a client resigns, no other clients are present to be elected
leader, and the client did not intend to shut down (should never, ever
happen), the client can still reelect itself after it's poll timeout
five seconds later, so even in the worst case scenario, we're still
protected.

Add a little more test coverage on leadership change function.

[1] https://github.com/riverqueue/river/actions/runs/8399744481/job/23006194307

brandur · 2024-03-23T06:29:07Z

internal/leadership/elector.go

+	}
+
+	// If this a resignation from _this_ client, ignore the change.
+	if notification.LeaderID == e.clientID {


There's a little noise in this PR because I extracted this function so I could test it more easily, but this is the relevant clause that was added here. I also added the early context check just below, but everything else just moves around and was preexisting.

bgentry

Nice fix!

This one's aimed at fixing a rare intermittent test observed here [1]: --- FAIL: TestElector_WithNotifier (0.00s) --- FAIL: TestElector_WithNotifier/LosesLeadership (0.21s) logger.go:225: time=2024-03-23T05:34:33.983Z level=INFO msg="Notifier: Listener connecting" elector_test.go:228: Starting test-client-id logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Elector: Listening for leadership changes" client_id=test-client-id topic=river_leadership logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership logger.go:225: time=2024-03-23T05:34:33.985Z level=INFO msg="Elector: Attempting to gain leadership" client_id=test-client-id logger.go:225: time=2024-03-23T05:34:34.055Z level=INFO msg="Elector: Gained leadership" client_id=test-client-id elector_test.go:232: Force resigning test-client-id logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Received notification from notifier" action=resigned client_id=test-client-id logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Current leader attempting reelect" client_id=test-client-id elector_test.go:243: Error Trace: /home/runner/work/river/river/internal/riverinternaltest/testfactory/test_factory.go:91 /home/runner/work/river/river/internal/leadership/elector_test.go:243 Error: Received unexpected error: ERROR: duplicate key value violates unique constraint "river_leader_pkey" (SQLSTATE 23505) Test: TestElector_WithNotifier/LosesLeadership This test case tries to confirm that an elector will lose leadership by force resigning it then inserting a leader record representing another client. In this case it failed because by the time we tried to insert our new leader record, the elector that we'd forcefully resigned had already re-elected itself, thereby causing the unique constraint violation. This happens because when a resignation notification is sent, it's actually received by all clients including the one that just resigned, and although this resignation usually gets ignored because we're busy shutting down, that's not necessarily the case and it's possible for a client to reelect itself. Here, correct the problem by ignoring resignation notifications from the same client ID. Presumably if a client just resigned it's for a reason, and we should immediately try to reelect. In case of a degenerate case where a client resigns, no other clients are present to be elected leader, and the client did not intend to shut down (should never, ever happen), the client can still reelect itself after it's poll timeout five seconds later, so even in the worst case scenario, we're still protected. Add a little more test coverage on leadership change function. [1] https://github.com/riverqueue/river/actions/runs/8399744481/job/23006194307

brandur · 2024-03-23T17:47:43Z

ty.

brandur force-pushed the brandur-leader-cannot-reelect-self branch from 1ee0e9d to 3686fd9 Compare March 23, 2024 06:28

brandur commented Mar 23, 2024

View reviewed changes

brandur requested a review from bgentry March 23, 2024 06:31

bgentry approved these changes Mar 23, 2024

View reviewed changes

brandur force-pushed the brandur-leader-cannot-reelect-self branch from 3686fd9 to 29de83a Compare March 23, 2024 17:44

brandur merged commit 3dae0f0 into master Mar 23, 2024
10 checks passed

brandur deleted the brandur-leader-cannot-reelect-self branch March 23, 2024 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent test: Leader should not be able to elect itself #286

Fix intermittent test: Leader should not be able to elect itself #286

brandur commented Mar 23, 2024

brandur Mar 23, 2024

bgentry left a comment

brandur commented Mar 23, 2024

Fix intermittent test: Leader should not be able to elect itself #286

Fix intermittent test: Leader should not be able to elect itself #286

Conversation

brandur commented Mar 23, 2024

brandur Mar 23, 2024

Choose a reason for hiding this comment

bgentry left a comment

Choose a reason for hiding this comment

brandur commented Mar 23, 2024