Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent test: Leader should not be able to elect itself #286

Merged
merged 1 commit into from
Mar 23, 2024

Conversation

brandur
Copy link
Contributor

@brandur brandur commented Mar 23, 2024

This one's aimed at fixing a rare intermittent test observed here [1]:

--- FAIL: TestElector_WithNotifier (0.00s)
    --- FAIL: TestElector_WithNotifier/LosesLeadership (0.21s)
        logger.go:225: time=2024-03-23T05:34:33.983Z level=INFO msg="Notifier: Listener connecting"
        elector_test.go:228: Starting test-client-id
        logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Elector: Listening for leadership changes" client_id=test-client-id topic=river_leadership
        logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership
        logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership
        logger.go:225: time=2024-03-23T05:34:33.985Z level=INFO msg="Elector: Attempting to gain leadership" client_id=test-client-id
        logger.go:225: time=2024-03-23T05:34:34.055Z level=INFO msg="Elector: Gained leadership" client_id=test-client-id
        elector_test.go:232: Force resigning test-client-id
        logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Received notification from notifier" action=resigned client_id=test-client-id
        logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Current leader attempting reelect" client_id=test-client-id
        elector_test.go:243:
                Error Trace:	/home/runner/work/river/river/internal/riverinternaltest/testfactory/test_factory.go:91
                                                        /home/runner/work/river/river/internal/leadership/elector_test.go:243
                Error:      	Received unexpected error:
                                ERROR: duplicate key value violates unique constraint "river_leader_pkey" (SQLSTATE 23505)
                Test:       	TestElector_WithNotifier/LosesLeadership

This test case tries to confirm that an elector will lose leadership by
force resigning it then inserting a leader record representing another
client. In this case it failed because by the time we tried to insert
our new leader record, the elector that we'd forcefully resigned had
already re-elected itself, thereby causing the unique constraint
violation.

This happens because when a resignation notification is sent, it's
actually received by all clients including the one that just resigned,
and although this resignation usually gets ignored because we're busy
shutting down, that's not necessarily the case and it's possible for a
client to reelect itself.

Here, correct the problem by ignoring resignation notifications from the
same client ID. Presumably if a client just resigned it's for a reason,
and we should immediately try to reelect. In case of a degenerate case
where a client resigns, no other clients are present to be elected
leader, and the client did not intend to shut down (should never, ever
happen), the client can still reelect itself after it's poll timeout
five seconds later, so even in the worst case scenario, we're still
protected.

Add a little more test coverage on leadership change function.

[1] https://github.com/riverqueue/river/actions/runs/8399744481/job/23006194307

@brandur brandur force-pushed the brandur-leader-cannot-reelect-self branch from 1ee0e9d to 3686fd9 Compare March 23, 2024 06:28
}

// If this a resignation from _this_ client, ignore the change.
if notification.LeaderID == e.clientID {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little noise in this PR because I extracted this function so I could test it more easily, but this is the relevant clause that was added here. I also added the early context check just below, but everything else just moves around and was preexisting.

@brandur brandur requested a review from bgentry March 23, 2024 06:31
Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

This one's aimed at fixing a rare intermittent test observed here [1]:

    --- FAIL: TestElector_WithNotifier (0.00s)
        --- FAIL: TestElector_WithNotifier/LosesLeadership (0.21s)
            logger.go:225: time=2024-03-23T05:34:33.983Z level=INFO msg="Notifier: Listener connecting"
            elector_test.go:228: Starting test-client-id
            logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Elector: Listening for leadership changes" client_id=test-client-id topic=river_leadership
            logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership
            logger.go:225: time=2024-03-23T05:34:33.984Z level=INFO msg="Notifier: Listening on topic" topic=river_leadership
            logger.go:225: time=2024-03-23T05:34:33.985Z level=INFO msg="Elector: Attempting to gain leadership" client_id=test-client-id
            logger.go:225: time=2024-03-23T05:34:34.055Z level=INFO msg="Elector: Gained leadership" client_id=test-client-id
            elector_test.go:232: Force resigning test-client-id
            logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Received notification from notifier" action=resigned client_id=test-client-id
            logger.go:225: time=2024-03-23T05:34:34.057Z level=INFO msg="Elector: Current leader attempting reelect" client_id=test-client-id
            elector_test.go:243:
                    Error Trace:	/home/runner/work/river/river/internal/riverinternaltest/testfactory/test_factory.go:91
                                                            /home/runner/work/river/river/internal/leadership/elector_test.go:243
                    Error:      	Received unexpected error:
                                    ERROR: duplicate key value violates unique constraint "river_leader_pkey" (SQLSTATE 23505)
                    Test:       	TestElector_WithNotifier/LosesLeadership

This test case tries to confirm that an elector will lose leadership by
force resigning it then inserting a leader record representing another
client. In this case it failed because by the time we tried to insert
our new leader record, the elector that we'd forcefully resigned had
already re-elected itself, thereby causing the unique constraint
violation.

This happens because when a resignation notification is sent, it's
actually received by all clients including the one that just resigned,
and although this resignation usually gets ignored because we're busy
shutting down, that's not necessarily the case and it's possible for a
client to reelect itself.

Here, correct the problem by ignoring resignation notifications from the
same client ID. Presumably if a client just resigned it's for a reason,
and we should immediately try to reelect. In case of a degenerate case
where a client resigns, no other clients are present to be elected
leader, and the client did not intend to shut down (should never, ever
happen), the client can still reelect itself after it's poll timeout
five seconds later, so even in the worst case scenario, we're still
protected.

Add a little more test coverage on leadership change function.

[1] https://github.com/riverqueue/river/actions/runs/8399744481/job/23006194307
@brandur brandur force-pushed the brandur-leader-cannot-reelect-self branch from 3686fd9 to 29de83a Compare March 23, 2024 17:44
@brandur
Copy link
Contributor Author

brandur commented Mar 23, 2024

ty.

@brandur brandur merged commit 3dae0f0 into master Mar 23, 2024
10 checks passed
@brandur brandur deleted the brandur-leader-cannot-reelect-self branch March 23, 2024 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants