Circumvent flaw design for AWS Elasticache by closing preventively connections #1910

pfreixes · 2021-09-27T20:32:07Z

pfreixes
Sep 27, 2021

We have faced some weird issues with AWS and Elasticache, from our understanding using AWS Elasticache with Redis Cluster enabled we should be able to cope with planned events without any disruption time, so having the capacity to keep doing writes at any moment of the planed event.

A planed event is an event that has been planed beforehand for making a change on your Cluster, like for example changing the instance size of your nodes.

Theoretically, this should have no impact. Far from the reality we have been identified that during this kind of events the system is impacted, for example when a new master with a different size is added into the cluster. Why?

Every time that they add a new node that eventually will get promoted as a master, it first of all gets replicated with all of the data and once this new node catches up with all of the data gets promoted as master while the former one is marked as a read replica, the former master and temporary read replica will be retired of the cluster after a while. During the time that is still alive provides the corresponding "MOVE X to Y" replies to any attempt to write to the node.

And then the problem?

We have identified that AWS is using the same hostname for identifying the the new master, thought it is pointing to a new IP address. The problem is that the former master is still being presented as healthy since you can still make requests using the opened connections. Hence the issue, the redirects [1] are a loop against the same node since the connection pool of the former master will be still in use.

And the solution?

Best solution will be a change on the design of AWS Elasticache, but if this happen will take some time.

Im proposing a workaround similar to what was done here [2] where we preventively close the connections when a certain error happens, in that case when we receive a MOVE X to Y error where the reported Y matches with the addr of the node that the connection belongs to.

This should "address" the problem in a none perfect way in most of the scenarios without having any impact, organic traffic should recycle all connections quickly and the retries and backoff applied to the ones that have failed should allow us to not impact the overall operations.

There were other issues like this [3] and this [4] that most likely were suffering from the same problem

WDYT?

[1] https://github.com/go-redis/redis/blob/master/cluster.go#L778
[2] #790
[3] #1633
[4] #917

pfreixes · 2021-09-28T12:39:34Z

pfreixes
Sep 28, 2021
Author

There is also a benefit of doing this which is unrelated to the AWS design, when same hostname is used for replacing a node we must recycle the connections. So by applying this workaround we would be able to solve also this use case.

WDYT? /cc @vmihailenco

0 replies

vmihailenco · 2021-09-29T06:40:01Z

vmihailenco
Sep 29, 2021
Collaborator

Thanks for the detailed explanation 👍

Im proposing a workaround similar to what was done here [2] where we preventively close the connections when a certain error happens, in that case when we receive a MOVE X to Y error where the reported Y matches with the addr of the node that the connection belongs to.

That sounds like a reasonable workaround assuming that Elasticache uses the hostname in MOVE X to Y errors.

But it looks like a good idea to always issue a c.state.LazyReload() on all moved/ask errors here. That should fix the issue for Elasticache and everyone else. And we already do this for pipelines in checkMovedErr, but normal commands were left uncovered for some reason...

Could you send a PR?

4 replies

pfreixes Sep 29, 2021
Author

I can take care of th PR, but in any case the call to the c.state.LazyReload() wont fix the issue since the former master is kept online for a few minutes for redirecting the traffic to the new one, so the connection pool of the former master is completely healthy. Indeed the loadState might use one of this connections, which would receive a slot mapping where hosts are still identified with the same name.

Not sure if the id of the node is being used, but that might help too since it is the unique identifier that is guaranteed that change always. Another solution for that would be:

Index the slots by id of the node
call the lazyReload for reloading the slots map

Both solutions should work.

am I missing something?

vmihailenco Sep 29, 2021
Collaborator

Yeah, it is more complex than I realized. I think we still should call lazyReload though :)

Index the slots by id of the node
call the lazyReload for reloading the slots map

Do you mean to index nodes by id, not slots? We still need the ability to lookup by addr since that is what returned in MOVED and ASK errors...

I guess we need to add clusterNode.id string field and check it in newClusterState:

			node, err := c.nodes.Get(addr)
			if err != nil {
				return nil, err
			}

+			if node.id == "" { node.id = slotNode.ID }
+			else if node.id != slotNode.ID { continue }

Does that make sense?

pfreixes Sep 29, 2021
Author

the only problem with that is the Get(addr) method which will return the same always the same node, since both have the same address we could change only this specific part by using a hash indexed by id. But mantaining the Get(addr) that is being used for getting the node instance.

With that and calling the lazyReload we should be able to give a chance - thanks to retries - to reach the right node eventually

vmihailenco Sep 29, 2021
Collaborator

I've sent #1913 to demonstrate the idea - PTAL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Circumvent flaw design for AWS Elasticache by closing preventively connections #1910

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Circumvent flaw design for AWS Elasticache by closing preventively connections #1910

Uh oh!

Uh oh!

pfreixes Sep 27, 2021

Replies: 2 comments · 4 replies

Uh oh!

pfreixes Sep 28, 2021 Author

Uh oh!

vmihailenco Sep 29, 2021 Collaborator

Uh oh!

pfreixes Sep 29, 2021 Author

Uh oh!

vmihailenco Sep 29, 2021 Collaborator

Uh oh!

pfreixes Sep 29, 2021 Author

Uh oh!

vmihailenco Sep 29, 2021 Collaborator

pfreixes
Sep 27, 2021

Replies: 2 comments 4 replies

pfreixes
Sep 28, 2021
Author

vmihailenco
Sep 29, 2021
Collaborator

pfreixes Sep 29, 2021
Author

vmihailenco Sep 29, 2021
Collaborator

pfreixes Sep 29, 2021
Author

vmihailenco Sep 29, 2021
Collaborator