Replies: 2 comments 4 replies
-
There is also a benefit of doing this which is unrelated to the AWS design, when same hostname is used for replacing a node we must recycle the connections. So by applying this workaround we would be able to solve also this use case. WDYT? /cc @vmihailenco |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed explanation 👍
That sounds like a reasonable workaround assuming that Elasticache uses the hostname in MOVE X to Y errors. But it looks like a good idea to always issue a Could you send a PR? |
Beta Was this translation helpful? Give feedback.
-
We have faced some weird issues with AWS and Elasticache, from our understanding using AWS Elasticache with Redis Cluster enabled we should be able to cope with planned events without any disruption time, so having the capacity to keep doing writes at any moment of the planed event.
A planed event is an event that has been planed beforehand for making a change on your Cluster, like for example changing the instance size of your nodes.
Theoretically, this should have no impact. Far from the reality we have been identified that during this kind of events the system is impacted, for example when a new master with a different size is added into the cluster. Why?
Every time that they add a new node that eventually will get promoted as a master, it first of all gets replicated with all of the data and once this new node catches up with all of the data gets promoted as master while the former one is marked as a read replica, the former master and temporary read replica will be retired of the cluster after a while. During the time that is still alive provides the corresponding "MOVE X to Y" replies to any attempt to write to the node.
And then the problem?
We have identified that AWS is using the same hostname for identifying the the new master, thought it is pointing to a new IP address. The problem is that the former master is still being presented as healthy since you can still make requests using the opened connections. Hence the issue, the redirects [1] are a loop against the same node since the connection pool of the former master will be still in use.
And the solution?
Best solution will be a change on the design of AWS Elasticache, but if this happen will take some time.
Im proposing a workaround similar to what was done here [2] where we preventively close the connections when a certain error happens, in that case when we receive a
MOVE X to Y
error where the reportedY
matches with the addr of the node that the connection belongs to.This should "address" the problem in a none perfect way in most of the scenarios without having any impact, organic traffic should recycle all connections quickly and the retries and backoff applied to the ones that have failed should allow us to not impact the overall operations.
There were other issues like this [3] and this [4] that most likely were suffering from the same problem
WDYT?
[1] https://github.com/go-redis/redis/blob/master/cluster.go#L778
[2] #790
[3] #1633
[4] #917
Beta Was this translation helpful? Give feedback.
All reactions