Description
Discussed in #3283
Originally posted by DingYuan0118 February 19, 2025
version: github.com/go-redis/redis/v8 v8.11.5
Since redis cluster client start to reload state every 10 seconds (Assume traffic always exists) likes below
func (c *clusterStateHolder) Get(ctx context.Context) (*clusterState, error) {
v := c.state.Load()
if v == nil {
return c.Reload(ctx)
}
state := v.(*clusterState)
if time.Since(state.createdAt) > 10*time.Second {
c.LazyReload()
}
return state, nil
}
and do GC
one minute after the Reload()
like:
func newClusterState(
nodes *clusterNodes, slots []ClusterSlot, origin string,
) (*clusterState, error) {
c := clusterState{
nodes: nodes,
slots: make([]*clusterSlot, 0, len(slots)),
generation: nodes.NextGeneration(),
createdAt: time.Now(),
}
originHost, _, _ := net.SplitHostPort(origin)
isLoopbackOrigin := isLoopback(originHost)
for _, slot := range slots {
......
}
sort.Sort(clusterSlotSlice(c.slots))
time.AfterFunc(time.Minute, func() {
nodes.GC(c.generation)
})
return &c, nil
}
When it starts to do the GC, the generation it uses is from 1 minute ago. This will result in a gap between clusterNode.generation
and GC generation
and causes activeAddrs
are not updated in time. I add some support log to verify this:
// GC removes unused nodes.
func (c *clusterNodes) GC(generation uint32) {
//nolint:prealloc
var collected []*clusterNode
c.mu.Lock()
fmt.Printf("time %s: GC, generation: %d\n", time.Now().Format("2006-01-02 15:04:05"), generation) // logs
fmt.Printf("time %s: GC, current nodes: %d\n", time.Now().Format("2006-01-02 15:04:05"), len(c.nodes)) // logs
c.activeAddrs = c.activeAddrs[:0]
for addr, node := range c.nodes {
fmt.Printf("time %s: GC, node addr: %s, generation: %d\n", time.Now().Format("2006-01-02 15:04:05"), addr, node.Generation())
if node.Generation() >= generation {
c.activeAddrs = append(c.activeAddrs, addr)
if c.opt.RouteByLatency {
go node.updateLatency()
}
continue
}
delete(c.nodes, addr)
collected = append(collected, node)
}
fmt.Printf("time %s: GC, current activeAddrs: %v\n", time.Now().Format("2006-01-02 15:04:05"), c.activeAddrs) // logs
c.mu.Unlock()
for _, node := range collected {
_ = node.Client.Close()
}
}
output like
time 2025-02-19 09:01:44: GC, generation: 1
time 2025-02-19 09:01:44: GC, current nodes: 7
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.12:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.16:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.11:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.15:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.13:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.14:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: redis.example.com:6379, generation: 0
time 2025-02-19 09:01:44: GC, current activeAddrs: [172.19.0.12:6379 172.19.0.16:6379 172.19.0.11:6379 172.19.0.15:6379 172.19.0.13:6379 172.19.0.14:6379]
We can find that nodes generation(which is 6) is greater than GC generation(which is 1).
If I wanna switch a redis cluster by operating the DNS record and shutting down the old cluster, it will take one minute to make the activeAddrs
be the new cluster IPs. If the old cluster reboot in this one minute, it will causes the real connection drift between the two clusters.
I wanna check is it an expected action or to prevent some problems?