Skip to content

Fix: Prevent routing reads to Redis slave nodes in loading state #3370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 21, 2025

Conversation

ofekshenawa
Copy link
Collaborator

Problem

When a new slave node is added to a Redis cluster, it enters a loading state while it synchronizes data from its master. If the client has ReadOnly=true enabled, it may route read operations to this loading slave node, which can result in nil values or errors being returned to the application.
closes #3222

Solution

This PR adds logic to detect when a slave node is in the loading state (by checking for LOADING errors) and temporarily excludes such nodes from the read distribution until they're ready to serve requests.

@ndyakov
Copy link
Member

ndyakov commented May 9, 2025

@ofekshenawa I am not sure (was not able to find information) if PING will return -LOADING errors. Let's think of a way to test this somehow? We will need to write tests for it anyway. Sure, we can mock the response and we will get the correct execution, but not sure on top of my head how to actually trigger a redis node to load the rdb for longer periods of time, so we can test. Any ideas?

@ndyakov
Copy link
Member

ndyakov commented May 9, 2025

Update: PING at the moment should return LOADING, but it may change. Keep that in mind. Let's think of a way to test this.

Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work, but we should find a good way to test it. Let's add a note about it.

@ofekshenawa ofekshenawa marked this pull request as ready for review May 21, 2025 07:07
@ndyakov ndyakov merged commit d7ba255 into master May 21, 2025
25 of 26 checks passed
@LINKIWI
Copy link
Contributor

LINKIWI commented May 28, 2025

Won't this perform a PING in the critical path of execution of any command against a slave node? The LOADING error status should be caught lazily and cached in-memory, bounded by a penalty timeout, mirroring the implementation of Failing: https://github.com/redis/go-redis/blob/v9.9.0/osscluster.go#L414

We are observing more than a 2x latency regression on v9.9.0, which includes this commit, when slave routing is enabled.

@ndyakov I think it would be a good idea to issue a patch release either with this commit reverted, or the behavior fixed forward as suggested above.

@ste93cry
Copy link

ste93cry commented Jun 16, 2025

@ndyakov Just like the comment above, in the project I'm working on we're pushing thousands of commands through a pipeline, and we've noticed that latency has more than doubled. This change really needs to be rolled back and handled differently, as sending a PING with every command just isn’t practical.

@ndyakov
Copy link
Member

ndyakov commented Jun 17, 2025

@LINKIWI , @ste93cry thank you for reporting this. The loaded state should / can be cached, you are both correct. @ste93cry for the time being, feel free to use a release before this patch, I will work on adding a cache and will release a new patch soon.

Let's assume we add a Loaded state and perform the PING operation only when it is 0. Once sucessfull we will set the state to 1 and not perform PING. One thing that I am not certain is - when should the Loaded state be returned to 0. One good candidate will be when a node is marked as Failing, but I assume there may be the case when we don't catch this. I am open to suggestions. For now I will implement the basic caching mechanism to reduce the latency overhead and will look deeper into it when we have more time.

@LINKIWI
Copy link
Contributor

LINKIWI commented Jun 17, 2025

Thanks for following up. Regarding when the cached loaded state should return to 0: Failing currently handles this problem with a 15s open circuit timeout. I think the loading state can be tracked similarly.

In most cases/for most users, I suspect slaves should only be in the LOADING state for less than 30 seconds. So it could make sense to use a more aggressive penalty timeout, maybe on the order of ~5 seconds. The side effect here is that a newly issued command after 5s could once again hit LOADING, at which point the slave would once again be marked ineligible for the next 5s.

A separate follow up here could be to transparently fail over command execution to another available slave after observing LOADING (and recording this internal state). This would both hide the user-facing errors (when > 1 replicas are available), not compromise steady-state latency, and mostly-avoid redundantly querying slaves during the loading phase.

@ndyakov
Copy link
Member

ndyakov commented Jun 17, 2025

@LINKIWI I think introducing a timeout will require a more testing to calculate a good default value of the timeout. Here is a naive approach without a timeout, let's continue the discussion there. I would prefer to have the naive approach merged and released and then, once we are the time we can find a better approach #3410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read nil from a new-joined slave node in Redis cluster.
4 participants