Skip to content

Conversation

@Spikhalskiy
Copy link
Contributor

Type of address problem with flow "just one timeout -> random node for nothing -> rediscovering" on each timed out request for Issue #1252:

  1. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis. Try proper slot until last attempt, if last attempt failed - just make cluster rediscovery, without trying random nodes which is useless.
  2. New special exception for “No reachable nodes”
  3. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes
  4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException

This PR mostly address my comment on another PR #1253 (comment)
And should be merged after merging this more critical PR #1253

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//But now if maxRedirections = 1 or 2 we will do it too often.

This comment worries me a little. The redis cluster my company runs is being used as a cache -- all master nodes, in-memory only. When a node goes down, we just want those requests to fail as fast as possible until we bring it back up or rebalance, so we're currently setting maxRedirects to 1, and setting very low socket connect/read timeouts. With this change, any time a client has a socket timeout, we'll renew the slot cache, which seems suboptimal (particularly because in our case, nothing will have changed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, any time a client has a socket timeout, we'll renew the slot cache, which seems suboptimal (particularly because in our case, nothing will have changed).

Without this change it's worse! It will try random node, get a MOVED and make rebalancing :)
So, even more useless work
Read this comment with description #1253 (comment)
So, just trying to make it less madness - without random node.

But the problem is we need cluster rediscovery. If we remove it from here - if node goes down JedisCluster will not discover new master.
So, ideally I see schema when connection exception tracking in any structure/class and we make rebalancing only sometimes. But for now I just optimized "random node -> moved -> rebalancing ->..." to "try all times -> rediscover cluster"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's happening to us when we hit this timeouts right now is that we fail with a JedisConnectException when runWithRetries is called the first time, and then we call runWithRetries recursively with redirections = 0, which immediately throws a JedisClusterMaxRedirectionsException. We'll never get a JedisMovedDataException, so we'll never actually try to renew the slot cache. (I think; it's been awhile since I stepped through this code.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then we call runWithRetries recursively with redirections = 0

I see another. We will call recursively with redirections-1 and tryRandom=true. So, next call going to random node, we of course get MOVED and make rediscovering EACH timeout. See stacktrace attached to my comment #1253 (comment).

at redis.clients.jedis.BinaryJedis.ping(BinaryJedis.java:106)
at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnection(JedisSlotBasedConnectionHandler.java:41)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:113)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:30)
at redis.clients.jedis.JedisCluster.setex(JedisCluster.java:292)

We get this random Jedis here, because it's purpose of getConnection method without parameters. And after that getting MOVED of course with rediscovering after that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's better than we don't do rediscovering at all and just fail. We will never discover slave became master after master fail in your schema.

We use RedisCluster with replication factor 3, because this cache is critical. So, it's important for us for Jedis to rediscover cluster in this cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( Ah, I see. Yeah, we were experiencing that too initially, which is ultimately why we set maxRedirects=1 (so we wouldn't get bounced around to the other nodes, which definitely don't have the data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you will not experience it with maxRedirects=1, but for us it's bad solution and what's going on with non values > 1 looks strange)

Do you really want to get rid of cluster rediscovering on timeouts if you have maxRedirects=1?
I could implement solution with tracking success/failures and make rediscovering rarely than 1 per... some time like 5s and only if all requests failed. But it will be more complicated and requires idea review firstly.

Or... we can make an exception for maxRedirects=1 as an additional flag to this method, because now with maxRedirects=1 it doesn't initiate cluster rediscovering... but it looks ugly. And you will not make rediscovering at all with maxRedirects=1 on this schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to get rid of cluster rediscovering on timeouts if you have maxRedirects=1?

I'm kind of torn. I don't know what the right thing is here. :(

I like the change to avoid using tryRandomNode -- for my use case (and for yours too, it sounds like), it only adds unnecessary hops within the cluster that aren't likely to result in success.

The renewSlotCache change adds unnecessary work for the Jedis clients for my use case, because we don't have any slaves to fail over, BUT it's totally possible my use case is weird, and your proposed solution here makes things better for the majority of users, in which case you should ignore me, and I'll try to figure out a way to make this code pluggable (maybe a RetryHandler, or something?) that could allow more flexibility in how clients fail. In any event, I'm just an enthusiast, and have no say in whether or not this gets merged anyway. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally fine to implement more complex solution in one more PR right after this that will avoid rediscovering often than once per time. And it will be both correct and fine for your case. And will not make rediscovering if nodes are just unstable, but available. Just need approving of this idea by @marcosnils and @HeartSaVioR in that case.

@Spikhalskiy Spikhalskiy changed the title Issue #1252: Random nodes on connection exception replaced with rediscovery Issue #1252: Random node + rediscovery on connection exception replaced with rediscovery at the end Apr 13, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spikhalskiy what if we change this condition to redirections == 1?. In this case we'll only perform discovery on the last redirection. It will also help with @roblg case that if he configures maxRedirections == 1, he won't get a discovery every on every request until he brings up his instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcosnils If we change it to 1 - nothing changes actually. It still make rediscovering only after last request and it will not change anything for @roblg case, it will still be true after first and once timeout. We can set it to 2. In that case it will make rediscover BEFORE last attempt and solve @roblg case, but it's bad solving. If we have @roblg case but with slaves - failed master could not be caught correctly, because no rediscovering will be proceed.

My proposal - limit rediscovers frequency for rediscovers based on JedisConnectionException and collect info about good responses to prevent rediscovering when nodes are just unstable (not on moved).

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Apr 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this name is absolutely misleading. It's not number of redirections, it's total number of attempts. So... maxRedirection = 1 means that there are no redirections, only one attempt. I would suggest to rename it to maxAttempts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to rename to maxAttempts.

Copy link
Contributor

@marcosnils marcosnils Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this name is absolutely misleading. It's not number of redirections, it's total number of attempts. So... maxRedirection = 1 means that there are no redirections, only one attempt. I would suggest to rename it to maxAttempts.

This is with the modifications on this PR correct?. Because as JedisCluster is today, if you set maxRedirections to 1, you can get redirected only once.

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because as JedisCluster is today, if you sex maxRedirections to 1, you can get redirected only once.

You will be 'directed' only once, but 'redirected' zero times. So, you make one 'attempt'. When I set maxRedirections to 1 I could assume that cluster client will make one request and follow one MOVED if required, when I set it to 0 I could assume that cluster client will make only one main attempt, but it doesn't work this way. (JedisClusterCommand#runWithRetries check redirections<=0 to throw exception) In addition, maxRredirections doesn't include information that this count would be used as a number of attempts in case of any fail. A user could assume that it has only "MOVED" related semantic. There are my thoughts about this name.

Copy link
Contributor

@marcosnils marcosnils Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spikhalskiy I see your point. Another important thing is that the redirected variable is decremented both in redirections and timeouts, so as you said it's not reflecting the intent. I believe renaming to maxAttempts and adding proper documentation would be enough

Can you please rebase and apply renaming?. I'll push so we can merge this today and release 2.9

@Spikhalskiy
Copy link
Contributor Author

@HeartSaVioR @marcosnils any updates here?

@marcosnils
Copy link
Contributor

@HeartSaVioR @xetorthio LGTM!. Please review this whenever you can as it's the last PR before releasing 2.9

1. New special exception for “No reachable nodes”
2. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes
3. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis
4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException
@Spikhalskiy
Copy link
Contributor Author

@marcosnils rebased, renaming (maxRedirections -> maxAttempts) done as a separate second commit in this PR.

@marcosnils
Copy link
Contributor

@Spikhalskiy thanks a lot.

@marcosnils marcosnils modified the milestones: 2.8.2, 2.9.0 Jul 19, 2016
@marcosnils
Copy link
Contributor

I'll go ahead and merge this as it went through a lot of discussion and changes seem to be consistent.

@marcosnils marcosnils merged commit 252dd87 into redis:master Jul 19, 2016
marcosnils pushed a commit that referenced this pull request Jul 19, 2016
…ed with rediscovery at the end (#1256)

* Issue #1252:
1. New special exception for “No reachable nodes”
2. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes
3. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis
4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException

Conflicts:
	src/main/java/redis/clients/jedis/BinaryJedisCluster.java
	src/main/java/redis/clients/jedis/JedisCluster.java
marcosnils pushed a commit that referenced this pull request Jul 19, 2016
…ed with rediscovery at the end (#1256)

* Issue #1252:
1. New special exception for “No reachable nodes”
2. Fixed situation when simple connection exception with tryRandomNode=true proceed as just no reachable nodes
3. Don’t try random node with immediate initiate cluster renewal after that if we got ConnectionTimeout from proper slot jedis
4. Fix rewriting real root cause JedisConnectionException with JedisClusterMaxRedirectionsException

Conflicts:
	src/main/java/redis/clients/jedis/BinaryJedisCluster.java
	src/main/java/redis/clients/jedis/JedisCluster.java
@marcosnils
Copy link
Contributor

Downmerged to 2.8 and 2.9 respectively. Thx for the contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants