goroutine leak when backends are down

## Expected Behavior

I have an HTTP server where every request triggers a query to a Redis cluster. When the backends are down, I expect each Get call to return with an error.

## Current Behavior

Call to Get block seemingly indefinitely, eventually causing an out-of-memory condition and a crash.

## Possible Solution

The blocking appears to be happening because of this call to cmdsInfoCache.Get:

https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/cluster.go#L1606

Which calls `internal.Once.Do`: https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/command.go#L3431

`internal.Once.Do` locks a mutex. All other requests that come in while `f` is running will block on that Mutex:
https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/internal/once.go#L50

When the backends are down, `f` returns an error. That results in unlocking the Mutex, allowing the next query to try and run `f`, also resulting in a failure. So all the queries are stacked up waiting on a single Mutex, which they take in turn.

Some options to fix:

 - `internal.Once` could respect a Context, so the waiters would at least time out rather than blocking indefinitely.
 - `cmdsInfoCache` could be treated as best-effort: If the cache is not yet available, queries would go ahead and execute without it. This would result in read-only commands only being sent to masters, at least until the cache became available.
 - The func that fills the cache could distinguish between temporary and permanent errors. On permanent errors it would log the error and return `nil`, which would indicate to the `internal.Once` that initialization was complete and should not be tried again.

## Steps to Reproduce

```go
package main

import (
	"context"
	"fmt"
	"runtime"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rdb := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"203.0.113.91:999",
		},
		DialTimeout:  time.Second,
		ReadTimeout:  time.Second,
		WriteTimeout: time.Second,
	})
	for i := 0; i < 100; i++ {
		ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
		defer cancel()
		go rdb.Get(ctx, "example")
		time.Sleep(200 * time.Millisecond)
	}
	fmt.Println(runtime.NumGoroutine())
}
```

```
$ go run main.go 
101
```

This demonstrates that all 100 goroutines are effectively leaked. The numbers can be adjusted to demonstrate any size of leak. In our HTTP server, 4000 requests per second equates to 4000 goroutines leaked per second.

## Context (Environment)

go version go1.18 linux/amd64
github.com/go-redis/redis/v8 v8.11.5

This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

goroutine leak when backends are down #2147

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

goroutine leak when backends are down #2147

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions