Skip to content

goroutine leak when backends are down #2147

Closed
@jsha

Description

@jsha

Expected Behavior

I have an HTTP server where every request triggers a query to a Redis cluster. When the backends are down, I expect each Get call to return with an error.

Current Behavior

Call to Get block seemingly indefinitely, eventually causing an out-of-memory condition and a crash.

Possible Solution

The blocking appears to be happening because of this call to cmdsInfoCache.Get:

https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/cluster.go#L1606

Which calls internal.Once.Do: https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/command.go#L3431

internal.Once.Do locks a mutex. All other requests that come in while f is running will block on that Mutex:
https://github.com/go-redis/redis/blob/9f5aacde23f23a9c45b59b675680e7d375c27de2/internal/once.go#L50

When the backends are down, f returns an error. That results in unlocking the Mutex, allowing the next query to try and run f, also resulting in a failure. So all the queries are stacked up waiting on a single Mutex, which they take in turn.

Some options to fix:

  • internal.Once could respect a Context, so the waiters would at least time out rather than blocking indefinitely.
  • cmdsInfoCache could be treated as best-effort: If the cache is not yet available, queries would go ahead and execute without it. This would result in read-only commands only being sent to masters, at least until the cache became available.
  • The func that fills the cache could distinguish between temporary and permanent errors. On permanent errors it would log the error and return nil, which would indicate to the internal.Once that initialization was complete and should not be tried again.

Steps to Reproduce

package main

import (
	"context"
	"fmt"
	"runtime"
	"time"

	"github.com/go-redis/redis/v8"
)

func main() {
	rdb := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"203.0.113.91:999",
		},
		DialTimeout:  time.Second,
		ReadTimeout:  time.Second,
		WriteTimeout: time.Second,
	})
	for i := 0; i < 100; i++ {
		ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
		defer cancel()
		go rdb.Get(ctx, "example")
		time.Sleep(200 * time.Millisecond)
	}
	fmt.Println(runtime.NumGoroutine())
}
$ go run main.go 
101

This demonstrates that all 100 goroutines are effectively leaked. The numbers can be adjusted to demonstrate any size of leak. In our HTTP server, 4000 requests per second equates to 4000 goroutines leaked per second.

Context (Environment)

go version go1.18 linux/amd64
github.com/go-redis/redis/v8 v8.11.5

This is a simplified version of a problem we encountered when bringing up the OCSP responder for https://github.com/letsencrypt/boulder/ with a set of backends that could not be reached.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions