-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Issue #1252: Random node + rediscovery on connection exception replaced with rediscovery at the end #1256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
84545ea to
ef5758c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//But now if maxRedirections = 1 or 2 we will do it too often.
This comment worries me a little. The redis cluster my company runs is being used as a cache -- all master nodes, in-memory only. When a node goes down, we just want those requests to fail as fast as possible until we bring it back up or rebalance, so we're currently setting maxRedirects to 1, and setting very low socket connect/read timeouts. With this change, any time a client has a socket timeout, we'll renew the slot cache, which seems suboptimal (particularly because in our case, nothing will have changed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, any time a client has a socket timeout, we'll renew the slot cache, which seems suboptimal (particularly because in our case, nothing will have changed).
Without this change it's worse! It will try random node, get a MOVED and make rebalancing :)
So, even more useless work
Read this comment with description #1253 (comment)
So, just trying to make it less madness - without random node.
But the problem is we need cluster rediscovery. If we remove it from here - if node goes down JedisCluster will not discover new master.
So, ideally I see schema when connection exception tracking in any structure/class and we make rebalancing only sometimes. But for now I just optimized "random node -> moved -> rebalancing ->..." to "try all times -> rediscover cluster"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's happening to us when we hit this timeouts right now is that we fail with a JedisConnectException when runWithRetries is called the first time, and then we call runWithRetries recursively with redirections = 0, which immediately throws a JedisClusterMaxRedirectionsException. We'll never get a JedisMovedDataException, so we'll never actually try to renew the slot cache. (I think; it's been awhile since I stepped through this code.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then we call runWithRetries recursively with redirections = 0
I see another. We will call recursively with redirections-1 and tryRandom=true. So, next call going to random node, we of course get MOVED and make rediscovering EACH timeout. See stacktrace attached to my comment #1253 (comment).
at redis.clients.jedis.BinaryJedis.ping(BinaryJedis.java:106)
at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnection(JedisSlotBasedConnectionHandler.java:41)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:113)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:30)
at redis.clients.jedis.JedisCluster.setex(JedisCluster.java:292)
We get this random Jedis here, because it's purpose of getConnection method without parameters. And after that getting MOVED of course with rediscovering after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's better than we don't do rediscovering at all and just fail. We will never discover slave became master after master fail in your schema.
We use RedisCluster with replication factor 3, because this cache is critical. So, it's important for us for Jedis to rediscover cluster in this cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( Ah, I see. Yeah, we were experiencing that too initially, which is ultimately why we set maxRedirects=1 (so we wouldn't get bounced around to the other nodes, which definitely don't have the data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you will not experience it with maxRedirects=1, but for us it's bad solution and what's going on with non values > 1 looks strange)
Do you really want to get rid of cluster rediscovering on timeouts if you have maxRedirects=1?
I could implement solution with tracking success/failures and make rediscovering rarely than 1 per... some time like 5s and only if all requests failed. But it will be more complicated and requires idea review firstly.
Or... we can make an exception for maxRedirects=1 as an additional flag to this method, because now with maxRedirects=1 it doesn't initiate cluster rediscovering... but it looks ugly. And you will not make rediscovering at all with maxRedirects=1 on this schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really want to get rid of cluster rediscovering on timeouts if you have maxRedirects=1?
I'm kind of torn. I don't know what the right thing is here. :(
I like the change to avoid using tryRandomNode -- for my use case (and for yours too, it sounds like), it only adds unnecessary hops within the cluster that aren't likely to result in success.
The renewSlotCache change adds unnecessary work for the Jedis clients for my use case, because we don't have any slaves to fail over, BUT it's totally possible my use case is weird, and your proposed solution here makes things better for the majority of users, in which case you should ignore me, and I'll try to figure out a way to make this code pluggable (maybe a RetryHandler, or something?) that could allow more flexibility in how clients fail. In any event, I'm just an enthusiast, and have no say in whether or not this gets merged anyway. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally fine to implement more complex solution in one more PR right after this that will avoid rediscovering often than once per time. And it will be both correct and fine for your case. And will not make rediscovering if nodes are just unstable, but available. Just need approving of this idea by @marcosnils and @HeartSaVioR in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Spikhalskiy what if we change this condition to redirections == 1?. In this case we'll only perform discovery on the last redirection. It will also help with @roblg case that if he configures maxRedirections == 1, he won't get a discovery every on every request until he brings up his instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosnils If we change it to 1 - nothing changes actually. It still make rediscovering only after last request and it will not change anything for @roblg case, it will still be true after first and once timeout. We can set it to 2. In that case it will make rediscover BEFORE last attempt and solve @roblg case, but it's bad solving. If we have @roblg case but with slaves - failed master could not be caught correctly, because no rediscovering will be proceed.
My proposal - limit rediscovers frequency for rediscovers based on JedisConnectionException and collect info about good responses to prevent rediscovering when nodes are just unstable (not on moved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this name is absolutely misleading. It's not number of redirections, it's total number of attempts. So... maxRedirection = 1 means that there are no redirections, only one attempt. I would suggest to rename it to maxAttempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to rename to maxAttempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this name is absolutely misleading. It's not number of redirections, it's total number of attempts. So... maxRedirection = 1 means that there are no redirections, only one attempt. I would suggest to rename it to maxAttempts.
This is with the modifications on this PR correct?. Because as JedisCluster is today, if you set maxRedirections to 1, you can get redirected only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because as JedisCluster is today, if you sex maxRedirections to 1, you can get redirected only once.
You will be 'directed' only once, but 'redirected' zero times. So, you make one 'attempt'. When I set maxRedirections to 1 I could assume that cluster client will make one request and follow one MOVED if required, when I set it to 0 I could assume that cluster client will make only one main attempt, but it doesn't work this way. (JedisClusterCommand#runWithRetries check redirections<=0 to throw exception) In addition, maxRredirections doesn't include information that this count would be used as a number of attempts in case of any fail. A user could assume that it has only "MOVED" related semantic. There are my thoughts about this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Spikhalskiy I see your point. Another important thing is that the redirected variable is decremented both in redirections and timeouts, so as you said it's not reflecting the intent. I believe renaming to maxAttempts and adding proper documentation would be enough
Can you please rebase and apply renaming?. I'll push so we can merge this today and release 2.9
ef5758c to
c567161
Compare
|
@HeartSaVioR @marcosnils any updates here? |
|
@HeartSaVioR @xetorthio LGTM!. Please review this whenever you can as it's the last PR before releasing 2.9 |
1. New special exception for “No reachable nodes” 2. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes 3. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis 4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException
c567161 to
3f2cddb
Compare
|
@marcosnils rebased, renaming (maxRedirections -> maxAttempts) done as a separate second commit in this PR. |
|
@Spikhalskiy thanks a lot. |
|
I'll go ahead and merge this as it went through a lot of discussion and changes seem to be consistent. |
…ed with rediscovery at the end (#1256) * Issue #1252: 1. New special exception for “No reachable nodes” 2. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes 3. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis 4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException Conflicts: src/main/java/redis/clients/jedis/BinaryJedisCluster.java src/main/java/redis/clients/jedis/JedisCluster.java
…ed with rediscovery at the end (#1256) * Issue #1252: 1. New special exception for “No reachable nodes” 2. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes 3. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis 4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException Conflicts: src/main/java/redis/clients/jedis/BinaryJedisCluster.java src/main/java/redis/clients/jedis/JedisCluster.java
|
Downmerged to 2.8 and 2.9 respectively. Thx for the contribution |
Type of address problem with flow "just one timeout -> random node for nothing -> rediscovering" on each timed out request for Issue #1252:
This PR mostly address my comment on another PR #1253 (comment)
And should be merged after merging this more critical PR #1253