Shutdown ClusterTopologyRefreshTask properly #2985

thachlp · 2024-09-12T06:18:41Z

Issue: #2904
Make sure that:

You have read the contribution guidelines.
You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
You submit test cases (unit or integration tests) that back your changes.

src/test/java/io/lettuce/core/cluster/RedisClusterClientIntegrationTests.java

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java

tishun

Thanks for giving this fix a go. I think, however, you may be on the wrong path.

Judging from the stack trace in #2904 the ClusterTopologyRefreshScheduler attempts to refresh the topology AFTER the connections have been closed and the client is shutting down.

The suspendTopologyRefresh() is supposed to suspend any topology refresh tasks, but it seems there is some case (race condition perhaps?) where a task is still executed during shurdown.

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java

thachlp · 2024-11-04T09:34:37Z

Hey @thachlp,

Thanks for giving this fix a go. I think, however, you may be on the wrong path.

Judging from the stack trace in #2904 the ClusterTopologyRefreshScheduler attempts to refresh the topology AFTER the connections have been closed and the client is shutting down.

The suspendTopologyRefresh() is supposed to suspend any topology refresh tasks, but it seems there is some case (race condition perhaps?) where a task is still executed during shurdown.

From the Java docs of suspendTopologyRefresh

    /**
     * Suspend periodic topology refresh if it was activated previously. Suspending cancels the periodic schedule without
     * interrupting any running topology refresh. Suspension is in place until obtaining a new {@link #connect connection}.
     *
     * @since 6.3
     */
    public void suspendTopologyRefresh() {
        topologyRefreshScheduler.suspendTopologyRefresh();
    }

From my view, when we shut down RedisClusterClient, we should STOP running CANCEL scheduled tasks, that why I write STOP running tasks.

Thank @tishun for explaining to me, do you have any suggestion for the fix?

tishun · 2024-11-05T12:42:06Z

I will try to come back to you in the end of the week

mp911de · 2024-11-06T10:44:33Z

This PR introduces a check for a very specific scenario. The change doesn't necessary lead to a proper cancellation as the task itself is comprised from a series of refresh steps that are coupled through completable future's. Specifically, RedisClusterClient.refreshPartitionsAsync(…) is being called that has no notion of being interrupted.

I think conceptually the easiest approach is to synchronize (and wait) until ClusterTopologyRefreshTask has finished before shutting down ClientResources. ClusterTopologyRefreshTask would require a CompletableFuture<Void> that is being completed upon completion of Supplier<CompletionStage<?>>.

It would require also a bit of housekeeping, e.g.

if (isEventLoopActive()) {
    clientResources.eventExecutorGroup().submit(clusterTopologyRefreshTask);
    return true;
}

isn't atomic, EventExecutorGroup.submit(…) could return a failed future that requires consideration as well.

Kvicii · 2024-12-19T04:43:42Z

@tishun @mp911de @thachlp
Is there any follow-up? I have the same problem.
issue-3089

tishun · 2024-12-23T11:03:41Z

As @mp911de mentioned we need to devise a better solution to this problem.
He has explained this in his comment here, I also elaborated more in #2904

thachlp · 2025-01-02T08:35:54Z

As @mp911de mentioned we need to devise a better solution to this problem. He has explained this in his comment here, I also elaborated more in #2904

Thanks @mp911de @tishun for reviews
As I understand, we should add the CompletableFuture to track the canceled completion of ClusterTopologyRefreshTask.
Please help review my implementation 🙇
Btw, is the fail test is random fail, because when I run on local, it was success.

tishun · 2025-01-06T17:03:19Z

Hey @thachlp ,
apologies, but I think this would still not work. Let me elaborate.

The suspendTopologyRefresh is not the problem itself, because all it does is make sure that no new refresh is scheduled. However any refresh that is already initiated is going to continue anyway.

Then when the event loop is shut down it would print this message.

What we need to do is:

when we initiate a refresh we need to indicate this with a lock
at the point where the event loop is being shut down we need to block on the said lock
if there is no holder of the lock (no refresh currently) the event loop will close normally
if there is a holder of the lock (a refresh is currently happening) the shutdown would block and wait
when the refresh is complete we should release the lock (only after the job is complete)

tishun

I think we are getting close to what we want to have.
It would be very hard to test this, but if you can think of some unit test that would also be handy. This is good to have, but not mandatory.

Thank you for spending time on this!

src/main/java/io/lettuce/core/cluster/ClusterTopologyRefreshScheduler.java

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java

thachlp

@tishun thanks for reviewing it.
I updated as you commented, please take a look 🙇

tishun · 2025-05-09T15:50:07Z

Okay, I kind of broke it completely.

I will come back to this next week.

ggivo reviewed Sep 12, 2024

View reviewed changes

src/test/java/io/lettuce/core/cluster/RedisClusterClientIntegrationTests.java Outdated Show resolved Hide resolved

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java Outdated Show resolved Hide resolved

tishun changed the title ~~Shutdonw clustertopologyrefreshtask properly~~ Shutdown ClusterTopologyRefreshTask properly Sep 13, 2024

tishun requested changes Oct 14, 2024

View reviewed changes

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java Outdated Show resolved Hide resolved

tishun added the status: waiting-for-feedback We need additional information before we can continue label Oct 18, 2024

Karonazaba mentioned this pull request Dec 23, 2024

enablePeriodicRefresh has problematic behavior when service is shut down(close) #3089

Closed

tishun mentioned this pull request Dec 23, 2024

BugReport: ClusterTopologyRefreshTask is not shutdown when RedisClusterClient is shutdown #2904

Open

thachlp closed this Dec 31, 2024

thachlp deleted the shutdonw-clustertopologyrefreshtask-properly branch December 31, 2024 04:14

thachlp restored the shutdonw-clustertopologyrefreshtask-properly branch December 31, 2024 04:16

thachlp reopened this Dec 31, 2024

thachlp force-pushed the shutdonw-clustertopologyrefreshtask-properly branch from fbb3951 to d8507a5 Compare December 31, 2024 04:34

thachlp requested review from tishun and ggivo January 2, 2025 10:24

tishun requested changes Feb 1, 2025

View reviewed changes

thachlp commented Feb 19, 2025

View reviewed changes

thachlp added 8 commits May 8, 2025 15:54

Shutdown ClusterTopologyRefreshTask when RedisClusterClient is shutdown

e4700b8

Update test

fb7e742

Revert change by comment

e820f02

Add the track of the completion of the topology refresh

9a60c7b

Format

30a5da3

Add the track of the completion of the topology refresh cancel

0789926

Implement lock mechanism

7dcfcc4

Update the comment

d71ae46

tishun force-pushed the shutdonw-clustertopologyrefreshtask-properly branch from be55f44 to d71ae46 Compare May 8, 2025 12:54

thachlp requested a review from tishun May 9, 2025 04:10

Polishing

95b43ed

tishun added status: waiting-for-triage and removed status: waiting-for-feedback We need additional information before we can continue labels May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shutdown ClusterTopologyRefreshTask properly #2985

Shutdown ClusterTopologyRefreshTask properly #2985

Uh oh!

thachlp commented Sep 12, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tishun left a comment

Uh oh!

Uh oh!

thachlp commented Nov 4, 2024

Uh oh!

tishun commented Nov 5, 2024

Uh oh!

mp911de commented Nov 6, 2024

Uh oh!

Kvicii commented Dec 19, 2024

Uh oh!

tishun commented Dec 23, 2024

Uh oh!

thachlp commented Jan 2, 2025

Uh oh!

tishun commented Jan 6, 2025

Uh oh!

tishun left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thachlp left a comment

Uh oh!

tishun commented May 9, 2025

Uh oh!

Uh oh!

Shutdown ClusterTopologyRefreshTask properly #2985

Are you sure you want to change the base?

Shutdown ClusterTopologyRefreshTask properly #2985

Uh oh!

Conversation

thachlp commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tishun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thachlp commented Nov 4, 2024

Uh oh!

tishun commented Nov 5, 2024

Uh oh!

mp911de commented Nov 6, 2024

Uh oh!

Kvicii commented Dec 19, 2024

Uh oh!

tishun commented Dec 23, 2024

Uh oh!

thachlp commented Jan 2, 2025

Uh oh!

tishun commented Jan 6, 2025

Uh oh!

tishun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thachlp left a comment

Choose a reason for hiding this comment

Uh oh!

tishun commented May 9, 2025

Uh oh!

Uh oh!

thachlp commented Sep 12, 2024 •

edited

Loading