Skip to content

Is generation in clusterNodes.GC() too old? #3284

Closed
@DingYuan0118

Description

@DingYuan0118

Discussed in #3283

Originally posted by DingYuan0118 February 19, 2025

version: github.com/go-redis/redis/v8 v8.11.5

Since redis cluster client start to reload state every 10 seconds (Assume traffic always exists) likes below

func (c *clusterStateHolder) Get(ctx context.Context) (*clusterState, error) {
	v := c.state.Load()
	if v == nil {
		return c.Reload(ctx)
	}

	state := v.(*clusterState)
	if time.Since(state.createdAt) > 10*time.Second {
		c.LazyReload()
	}
	return state, nil
}

and do GC one minute after the Reload() like:

func newClusterState(
	nodes *clusterNodes, slots []ClusterSlot, origin string,
) (*clusterState, error) {
	c := clusterState{
		nodes: nodes,

		slots: make([]*clusterSlot, 0, len(slots)),

		generation: nodes.NextGeneration(),
		createdAt:  time.Now(),
	}

	originHost, _, _ := net.SplitHostPort(origin)
	isLoopbackOrigin := isLoopback(originHost)

	for _, slot := range slots {
              ......
	}
	sort.Sort(clusterSlotSlice(c.slots))

	time.AfterFunc(time.Minute, func() {
		nodes.GC(c.generation)
	})

	return &c, nil
}

When it starts to do the GC, the generation it uses is from 1 minute ago. This will result in a gap between clusterNode.generation and GC generation and causes activeAddrs are not updated in time. I add some support log to verify this:

// GC removes unused nodes.
func (c *clusterNodes) GC(generation uint32) {
	//nolint:prealloc
	var collected []*clusterNode

	c.mu.Lock()
	fmt.Printf("time %s: GC, generation: %d\n", time.Now().Format("2006-01-02 15:04:05"), generation)  // logs
	fmt.Printf("time %s: GC, current nodes: %d\n", time.Now().Format("2006-01-02 15:04:05"), len(c.nodes)) // logs

	c.activeAddrs = c.activeAddrs[:0]
	for addr, node := range c.nodes {
		fmt.Printf("time %s: GC, node addr: %s, generation: %d\n", time.Now().Format("2006-01-02 15:04:05"), addr, node.Generation())
		if node.Generation() >= generation {
			c.activeAddrs = append(c.activeAddrs, addr)
			if c.opt.RouteByLatency {
				go node.updateLatency()
			}
			continue
		}

		delete(c.nodes, addr)
		collected = append(collected, node)
	}
	fmt.Printf("time %s: GC, current activeAddrs: %v\n", time.Now().Format("2006-01-02 15:04:05"), c.activeAddrs) // logs
	c.mu.Unlock()

	for _, node := range collected {
		_ = node.Client.Close()
	}
}

output like

time 2025-02-19 09:01:44: GC, generation: 1
time 2025-02-19 09:01:44: GC, current nodes: 7
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.12:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.16:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.11:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.15:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.13:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.14:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: redis.example.com:6379, generation: 0
time 2025-02-19 09:01:44: GC, current activeAddrs: [172.19.0.12:6379 172.19.0.16:6379 172.19.0.11:6379 172.19.0.15:6379 172.19.0.13:6379 172.19.0.14:6379]

We can find that nodes generation(which is 6) is greater than GC generation(which is 1).

If I wanna switch a redis cluster by operating the DNS record and shutting down the old cluster, it will take one minute to make the activeAddrs be the new cluster IPs. If the old cluster reboot in this one minute, it will causes the real connection drift between the two clusters.

I wanna check is it an expected action or to prevent some problems?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions