Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued)

_(See [previous issue](https://github.com/go-redis/redis/issues/1912) for additional context and discussion)_

## Expected Behavior

A `ClusterClient` with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as _ECRedis_) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node. 

Example `ClusterClient` config:

```Go
redis.NewClusterClient(&redis.ClusterOptions{
    Addrs: []string{redisAddr},
    TLSConfig: &tls.Config{},
})
```

## Current Behavior

_(The **Current Behavior** section in the [original issue](https://github.com/go-redis/redis/issues/1912) is still accurate to this issue)_

Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:
- Amount of load on the system (# Redis commands)
- Events happening on the ECRedis cluster (resharding, cycling out nodes, failovers, etc.)
- Any other issues with the ECRedis cluster not normally visible in the AWS Console (we consulted with AWS support for this one)
- Service restarts for our Go service that communicates with ECRedis

When this issue happens, running `CLIENT LIST` on the affected Redis node shows `age=0` or `age=1` for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.

After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.

## Possible Solution

Redis `ClusterClient` should react more gracefully to <unknown event> that quickly devolves into constantly reestablishing connections to some node in the AWS ElastiCache Redis cluster.

After the previous issue, we have tried a _variety_ of approaches to mitigate this problem, none of which have solved it entirely:
- Increasing all timeouts in `context.Context`s passed to Redis commands to 1 second or greater
- Reducing # of goroutines exercising our Redis claim logic
- Lowering connection pool size
- Moving all operations out of `MULTI`/`EXEC` pipelines
- Completely disabling idle connection reaping
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections
- Changing the instance type of the ElastiCache Redis nodes
- Rebalancing/increasing shard count for the ElastiCache Redis cluster

We are seeing the issue a lot less often after

- Reducing # of goroutines exercising our Redis claim logic
- Lowering pool size
- Fuzzing the interval at which each individual cluster node's Client cleans idle connections

but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node. 

### Details About Fuzzing Idle Connection Cleanup

We opted to set `IdleTimeout` and `IdleCheckFrequency` on the `redis.ClusterOptions` passed to our client to `-1` and added a custom `NewClient` implementation:

```Go
// A new redis.Client is created for each Redis cluster node to which a redis.ClusterClient
// sends commands. Make each redis.Client clean idle connections at a different cadence.
redisOpts.NewClient = func(clientOptions *redis.Options) *redis.Client {
	rand.Seed(time.Now().UnixNano())
	idleTimeoutFuzz := time.Duration(rand.Int63n(int64(idleTimeoutFuzzRange)))

	// It's safe to modify this *redis.Options in this way - a new data
	// structure is created and passed to this function for each client created
	clientOptions.IdleTimeout = baseIdleTimeout + idleTimeoutFuzz

	return redis.NewClient(clientOptions)
}
```

### Resetting the System to a Good State

When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes `SRANDMEMBER pending 10` -> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.

## Steps to Reproduce

The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both `go-redis` and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.

## Context (Environment)

We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:
- `SRANDMEMBER pending 10` - grab up to 10 random items from the pool of available work
- `ZADD in_progress <current_timestamp> <grabbed_item>` for each of our items we got in the previous step
- Any work items we weren't able to `ZADD` have been claimed by some other instance of the service, skip them
- Once we're done with a work item, `SREM pending <grabbed_item>`
- Periodically `ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago>` so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second window

Work item producers simply `SADD pending <work_item>` but the load they produce is miniscule compared to the consumers running the algorithm outlined above.

Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our `ClusterClient` to use a `PoolSize` of `10` (reduced from the default of `20` on these instances) for each child `Client` it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.

At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.

Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using `github.com/go-redis/redis/v8 v8.0.0`, `github.com/go-redis/redis/v8 v8.11.2`, and `github.com/go-redis/redis/v8 v8.11.4` - all 3 have run into this issue. We are now using a fork of `v8.11.4` (see added commits [here](https://github.com/enjmusic/redis/commits/log_removed_connections)) with the following changes (some suggested in the previous issue, some thought of by us):
- Add logs in spots where bad connections are removed
- Add hooks in `ClusterClient.process` method for logging/metrics around various anomalies during processing commands (`MOVED`/`ASK` responses, errors getting nodes, node failures). We do **not** see these while the issue is happening.

We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of `cache.r6g.large` instances.

## Detailed Description

N/A

## Possible Implementation

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued) #2046

Expected Behavior

Current Behavior

Possible Solution

Details About Fuzzing Idle Connection Cleanup

Resetting the System to a Good State

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued) #2046

Description

Expected Behavior

Current Behavior

Possible Solution

Details About Fuzzing Idle Connection Cleanup

Resetting the System to a Good State

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions