Description
(See previous issue for additional context and discussion)
Expected Behavior
A ClusterClient
with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as ECRedis) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node.
Example ClusterClient
config:
redis.NewClusterClient(&redis.ClusterOptions{
Addrs: []string{redisAddr},
TLSConfig: &tls.Config{},
})
Current Behavior
(The Current Behavior section in the original issue is still accurate to this issue)
Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:
- Amount of load on the system (# Redis commands)
- Events happening on the ECRedis cluster (resharding, cycling out nodes, failovers, etc.)
- Any other issues with the ECRedis cluster not normally visible in the AWS Console (we consulted with AWS support for this one)
- Service restarts for our Go service that communicates with ECRedis
When this issue happens, running CLIENT LIST
on the affected Redis node shows age=0
or age=1
for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.
After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.
Possible Solution
Redis ClusterClient
should react more gracefully to that quickly devolves into constantly reestablishing connections to some node in the AWS ElastiCache Redis cluster.
After the previous issue, we have tried a variety of approaches to mitigate this problem, none of which have solved it entirely:
- Increasing all timeouts in
context.Context
s passed to Redis commands to 1 second or greater - Reducing # of goroutines exercising our Redis claim logic
- Lowering connection pool size
- Moving all operations out of
MULTI
/EXEC
pipelines - Completely disabling idle connection reaping
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections
- Changing the instance type of the ElastiCache Redis nodes
- Rebalancing/increasing shard count for the ElastiCache Redis cluster
We are seeing the issue a lot less often after
- Reducing # of goroutines exercising our Redis claim logic
- Lowering pool size
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections
but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node.
Details About Fuzzing Idle Connection Cleanup
We opted to set IdleTimeout
and IdleCheckFrequency
on the redis.ClusterOptions
passed to our client to -1
and added a custom NewClient
implementation:
// A new redis.Client is created for each Redis cluster node to which a redis.ClusterClient
// sends commands. Make each redis.Client clean idle connections at a different cadence.
redisOpts.NewClient = func(clientOptions *redis.Options) *redis.Client {
rand.Seed(time.Now().UnixNano())
idleTimeoutFuzz := time.Duration(rand.Int63n(int64(idleTimeoutFuzzRange)))
// It's safe to modify this *redis.Options in this way - a new data
// structure is created and passed to this function for each client created
clientOptions.IdleTimeout = baseIdleTimeout + idleTimeoutFuzz
return redis.NewClient(clientOptions)
}
Resetting the System to a Good State
When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes SRANDMEMBER pending 10
-> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.
Steps to Reproduce
The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both go-redis
and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.
Context (Environment)
We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:
SRANDMEMBER pending 10
- grab up to 10 random items from the pool of available workZADD in_progress <current_timestamp> <grabbed_item>
for each of our items we got in the previous step- Any work items we weren't able to
ZADD
have been claimed by some other instance of the service, skip them - Once we're done with a work item,
SREM pending <grabbed_item>
- Periodically
ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago>
so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second window
Work item producers simply SADD pending <work_item>
but the load they produce is miniscule compared to the consumers running the algorithm outlined above.
Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our ClusterClient
to use a PoolSize
of 10
(reduced from the default of 20
on these instances) for each child Client
it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.
At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.
Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using github.com/go-redis/redis/v8 v8.0.0
, github.com/go-redis/redis/v8 v8.11.2
, and github.com/go-redis/redis/v8 v8.11.4
- all 3 have run into this issue. We are now using a fork of v8.11.4
(see added commits here) with the following changes (some suggested in the previous issue, some thought of by us):
- Add logs in spots where bad connections are removed
- Add hooks in
ClusterClient.process
method for logging/metrics around various anomalies during processing commands (MOVED
/ASK
responses, errors getting nodes, node failures). We do not see these while the issue is happening.
We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of cache.r6g.large
instances.
Detailed Description
N/A
Possible Implementation
N/A