Skip to content

Commit ed62246

Browse files
authored
Allow setting ring heartbeat timeout to zero to disable timeout check. (cortexproject#4342)
* Allow setting ring heartbeat timeout to zero to disable timeout check. This change allows the various ring heartbeat timeouts to be configured with zero, as a means of disabling the timeout. This is expected to be used with a separate enhancement to allow disabling heartbeats. When the heartbeat timeout is disabled, instances will always appear as healthy in the ring. Signed-off-by: Steve Simpson <steve.simpson@grafana.com> * Review comments. Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
1 parent f8b08a3 commit ed62246

File tree

14 files changed

+70
-19
lines changed

14 files changed

+70
-19
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
* [CHANGE] Querier / ruler: Change `-querier.max-fetched-chunks-per-query` configuration to limit to maximum number of chunks that can be fetched in a single query. The number of chunks fetched by ingesters AND long-term storare combined should not exceed the value configured on `-querier.max-fetched-chunks-per-query`. #4260
66
* [ENHANCEMENT] Add timeout for waiting on compactor to become ACTIVE in the ring. #4262
77
* [ENHANCEMENT] Reduce memory used by streaming queries, particularly in ruler. #4341
8+
* [ENHANCEMENT] Ring: allow experimental configuration of disabling of heartbeat timeouts by setting the relevant configuration value to zero. Applies to the following: #4342
9+
* `-distributor.ring.heartbeat-timeout`
10+
* `-ring.heartbeat-timeout`
11+
* `-ruler.ring.heartbeat-timeout`
12+
* `-alertmanager.sharding-ring.heartbeat-timeout`
13+
* `-compactor.ring.heartbeat-timeout`
14+
* `-store-gateway.sharding-ring.heartbeat-timeout`
815
* [BUGFIX] HA Tracker: when cleaning up obsolete elected replicas from KV store, tracker didn't update number of cluster per user correctly. #4336
916

1017
## 1.10.0-rc.0 / 2021-06-28

docs/blocks-storage/compactor.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ compactor:
214214
[heartbeat_period: <duration> | default = 5s]
215215

216216
# The heartbeat timeout after which compactors are considered unhealthy
217-
# within the ring.
217+
# within the ring. 0 = never (timeout disabled).
218218
# CLI flag: -compactor.ring.heartbeat-timeout
219219
[heartbeat_timeout: <duration> | default = 1m]
220220

docs/blocks-storage/store-gateway.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,8 +237,8 @@ store_gateway:
237237
[heartbeat_period: <duration> | default = 15s]
238238

239239
# The heartbeat timeout after which store gateways are considered unhealthy
240-
# within the ring. This option needs be set both on the store-gateway and
241-
# querier when running in microservices mode.
240+
# within the ring. 0 = never (timeout disabled). This option needs be set
241+
# both on the store-gateway and querier when running in microservices mode.
242242
# CLI flag: -store-gateway.sharding-ring.heartbeat-timeout
243243
[heartbeat_timeout: <duration> | default = 1m]
244244

docs/configuration/config-file-reference.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -568,7 +568,7 @@ ring:
568568
[heartbeat_period: <duration> | default = 5s]
569569
570570
# The heartbeat timeout after which distributors are considered unhealthy
571-
# within the ring.
571+
# within the ring. 0 = never (timeout disabled).
572572
# CLI flag: -distributor.ring.heartbeat-timeout
573573
[heartbeat_timeout: <duration> | default = 1m]
574574
@@ -662,6 +662,7 @@ lifecycler:
662662
[mirror_timeout: <duration> | default = 2s]
663663
664664
# The heartbeat timeout after which ingesters are skipped for reads/writes.
665+
# 0 = never (timeout disabled).
665666
# CLI flag: -ring.heartbeat-timeout
666667
[heartbeat_timeout: <duration> | default = 1m]
667668
@@ -1585,7 +1586,7 @@ ring:
15851586
[heartbeat_period: <duration> | default = 5s]
15861587

15871588
# The heartbeat timeout after which rulers are considered unhealthy within the
1588-
# ring.
1589+
# ring. 0 = never (timeout disabled).
15891590
# CLI flag: -ruler.ring.heartbeat-timeout
15901591
[heartbeat_timeout: <duration> | default = 1m]
15911592

@@ -1906,7 +1907,7 @@ sharding_ring:
19061907
[heartbeat_period: <duration> | default = 15s]
19071908
19081909
# The heartbeat timeout after which alertmanagers are considered unhealthy
1909-
# within the ring.
1910+
# within the ring. 0 = never (timeout disabled).
19101911
# CLI flag: -alertmanager.sharding-ring.heartbeat-timeout
19111912
[heartbeat_timeout: <duration> | default = 1m]
19121913
@@ -5178,7 +5179,7 @@ sharding_ring:
51785179
[heartbeat_period: <duration> | default = 5s]
51795180
51805181
# The heartbeat timeout after which compactors are considered unhealthy within
5181-
# the ring.
5182+
# the ring. 0 = never (timeout disabled).
51825183
# CLI flag: -compactor.ring.heartbeat-timeout
51835184
[heartbeat_timeout: <duration> | default = 1m]
51845185
@@ -5256,8 +5257,8 @@ sharding_ring:
52565257
[heartbeat_period: <duration> | default = 15s]
52575258
52585259
# The heartbeat timeout after which store gateways are considered unhealthy
5259-
# within the ring. This option needs be set both on the store-gateway and
5260-
# querier when running in microservices mode.
5260+
# within the ring. 0 = never (timeout disabled). This option needs be set both
5261+
# on the store-gateway and querier when running in microservices mode.
52615262
# CLI flag: -store-gateway.sharding-ring.heartbeat-timeout
52625263
[heartbeat_timeout: <duration> | default = 1m]
52635264

docs/configuration/v1-guarantees.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,10 @@ Currently experimental features are:
8181
- user config size (`-alertmanager.max-config-size-bytes`)
8282
- templates count in user config (`-alertmanager.max-templates-count`)
8383
- max template size (`-alertmanager.max-template-size-bytes`)
84+
- Disabling ring heartbeat timeouts
85+
- `-distributor.ring.heartbeat-timeout=0`
86+
- `-ring.heartbeat-timeout=0`
87+
- `-ruler.ring.heartbeat-timeout=0`
88+
- `-alertmanager.sharding-ring.heartbeat-timeout=0`
89+
- `-compactor.ring.heartbeat-timeout=0`
90+
- `-store-gateway.sharding-ring.heartbeat-timeout=0`

pkg/alertmanager/alertmanager_ring.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
7777
// Ring flags
7878
cfg.KVStore.RegisterFlagsWithPrefix(rfprefix, "alertmanagers/", f)
7979
f.DurationVar(&cfg.HeartbeatPeriod, rfprefix+"heartbeat-period", 15*time.Second, "Period at which to heartbeat to the ring.")
80-
f.DurationVar(&cfg.HeartbeatTimeout, rfprefix+"heartbeat-timeout", time.Minute, "The heartbeat timeout after which alertmanagers are considered unhealthy within the ring.")
80+
f.DurationVar(&cfg.HeartbeatTimeout, rfprefix+"heartbeat-timeout", time.Minute, "The heartbeat timeout after which alertmanagers are considered unhealthy within the ring. 0 = never (timeout disabled).")
8181
f.IntVar(&cfg.ReplicationFactor, rfprefix+"replication-factor", 3, "The replication factor to use when sharding the alertmanager.")
8282
f.BoolVar(&cfg.ZoneAwarenessEnabled, rfprefix+"zone-awareness-enabled", false, "True to enable zone-awareness and replicate alerts across different availability zones.")
8383

pkg/compactor/compactor_ring.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
5151
// Ring flags
5252
cfg.KVStore.RegisterFlagsWithPrefix("compactor.ring.", "collectors/", f)
5353
f.DurationVar(&cfg.HeartbeatPeriod, "compactor.ring.heartbeat-period", 5*time.Second, "Period at which to heartbeat to the ring.")
54-
f.DurationVar(&cfg.HeartbeatTimeout, "compactor.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which compactors are considered unhealthy within the ring.")
54+
f.DurationVar(&cfg.HeartbeatTimeout, "compactor.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which compactors are considered unhealthy within the ring. 0 = never (timeout disabled).")
5555

5656
// Wait stability flags.
5757
f.DurationVar(&cfg.WaitStabilityMinDuration, "compactor.ring.wait-stability-min-duration", time.Minute, "Minimum time to wait for ring stability at startup. 0 to disable.")

pkg/distributor/distributor_ring.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
4343
// Ring flags
4444
cfg.KVStore.RegisterFlagsWithPrefix("distributor.ring.", "collectors/", f)
4545
f.DurationVar(&cfg.HeartbeatPeriod, "distributor.ring.heartbeat-period", 5*time.Second, "Period at which to heartbeat to the ring.")
46-
f.DurationVar(&cfg.HeartbeatTimeout, "distributor.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which distributors are considered unhealthy within the ring.")
46+
f.DurationVar(&cfg.HeartbeatTimeout, "distributor.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which distributors are considered unhealthy within the ring. 0 = never (timeout disabled).")
4747

4848
// Instance flags
4949
cfg.InstanceInterfaceNames = []string{"eth0", "en0"}

pkg/ring/model.go

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ func (d *Desc) FindIngestersByState(state InstanceState) []InstanceDesc {
101101
func (d *Desc) Ready(now time.Time, heartbeatTimeout time.Duration) error {
102102
numTokens := 0
103103
for id, ingester := range d.Ingesters {
104-
if now.Sub(time.Unix(ingester.Timestamp, 0)) > heartbeatTimeout {
104+
if !ingester.IsHeartbeatHealthy(heartbeatTimeout, now) {
105105
return fmt.Errorf("instance %s past heartbeat timeout", id)
106106
} else if ingester.State != ACTIVE {
107107
return fmt.Errorf("instance %s in state %v", id, ingester.State)
@@ -136,7 +136,16 @@ func (i *InstanceDesc) GetRegisteredAt() time.Time {
136136
func (i *InstanceDesc) IsHealthy(op Operation, heartbeatTimeout time.Duration, now time.Time) bool {
137137
healthy := op.IsInstanceInStateHealthy(i.State)
138138

139-
return healthy && now.Unix()-i.Timestamp <= heartbeatTimeout.Milliseconds()/1000
139+
return healthy && i.IsHeartbeatHealthy(heartbeatTimeout, now)
140+
}
141+
142+
// IsHeartbeatHealthy returns whether the heartbeat timestamp for the ingester is within the
143+
// specified timeout period. A timeout of zero disables the timeout; the heartbeat is ignored.
144+
func (i *InstanceDesc) IsHeartbeatHealthy(heartbeatTimeout time.Duration, now time.Time) bool {
145+
if heartbeatTimeout == 0 {
146+
return true
147+
}
148+
return now.Sub(time.Unix(i.Timestamp, 0)) <= heartbeatTimeout
140149
}
141150

142151
// Merge merges other ring into this one. Returns sub-ring that represents the change,

pkg/ring/model_test.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,10 +136,18 @@ func TestDesc_Ready(t *testing.T) {
136136
t.Fatal("expected ready, got", err)
137137
}
138138

139+
if err := r.Ready(now, 0); err != nil {
140+
t.Fatal("expected ready, got", err)
141+
}
142+
139143
if err := r.Ready(now.Add(5*time.Minute), 10*time.Second); err == nil {
140144
t.Fatal("expected !ready (no heartbeat from active ingester), but got no error")
141145
}
142146

147+
if err := r.Ready(now.Add(5*time.Minute), 0); err != nil {
148+
t.Fatal("expected ready (no heartbeat but timeout disabled), got", err)
149+
}
150+
143151
r = &Desc{
144152
Ingesters: map[string]InstanceDesc{
145153
"ing1": {

pkg/ring/ring.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
147147
func (cfg *Config) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
148148
cfg.KVStore.RegisterFlagsWithPrefix(prefix, "collectors/", f)
149149

150-
f.DurationVar(&cfg.HeartbeatTimeout, prefix+"ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which ingesters are skipped for reads/writes.")
150+
f.DurationVar(&cfg.HeartbeatTimeout, prefix+"ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which ingesters are skipped for reads/writes. 0 = never (timeout disabled).")
151151
f.IntVar(&cfg.ReplicationFactor, prefix+"distributor.replication-factor", 3, "The number of ingesters to write to and read from.")
152152
f.BoolVar(&cfg.ZoneAwarenessEnabled, prefix+"distributor.zone-awareness-enabled", false, "True to enable the zone-awareness and replicate ingested samples across different availability zones.")
153153
}

pkg/ring/ring_test.go

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -390,11 +390,11 @@ func TestRing_GetAllHealthy(t *testing.T) {
390390
}
391391

392392
func TestRing_GetReplicationSetForOperation(t *testing.T) {
393-
const heartbeatTimeout = time.Minute
394393
now := time.Now()
395394

396395
tests := map[string]struct {
397396
ringInstances map[string]InstanceDesc
397+
ringHeartbeatTimeout time.Duration
398398
ringReplicationFactor int
399399
expectedErrForRead error
400400
expectedSetForRead []string
@@ -405,6 +405,7 @@ func TestRing_GetReplicationSetForOperation(t *testing.T) {
405405
}{
406406
"should return error on empty ring": {
407407
ringInstances: nil,
408+
ringHeartbeatTimeout: time.Minute,
408409
ringReplicationFactor: 1,
409410
expectedErrForRead: ErrEmptyRing,
410411
expectedErrForWrite: ErrEmptyRing,
@@ -418,6 +419,21 @@ func TestRing_GetReplicationSetForOperation(t *testing.T) {
418419
"instance-4": {Addr: "127.0.0.4", State: ACTIVE, Timestamp: now.Add(-30 * time.Second).Unix(), Tokens: GenerateTokens(128, nil)},
419420
"instance-5": {Addr: "127.0.0.5", State: ACTIVE, Timestamp: now.Add(-40 * time.Second).Unix(), Tokens: GenerateTokens(128, nil)},
420421
},
422+
ringHeartbeatTimeout: time.Minute,
423+
ringReplicationFactor: 1,
424+
expectedSetForRead: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4", "127.0.0.5"},
425+
expectedSetForWrite: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4", "127.0.0.5"},
426+
expectedSetForReporting: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4", "127.0.0.5"},
427+
},
428+
"should succeed on instances with old timestamps but heartbeat timeout disabled": {
429+
ringInstances: map[string]InstanceDesc{
430+
"instance-1": {Addr: "127.0.0.1", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
431+
"instance-2": {Addr: "127.0.0.2", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
432+
"instance-3": {Addr: "127.0.0.3", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
433+
"instance-4": {Addr: "127.0.0.4", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
434+
"instance-5": {Addr: "127.0.0.5", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
435+
},
436+
ringHeartbeatTimeout: 0,
421437
ringReplicationFactor: 1,
422438
expectedSetForRead: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4", "127.0.0.5"},
423439
expectedSetForWrite: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4", "127.0.0.5"},
@@ -431,6 +447,7 @@ func TestRing_GetReplicationSetForOperation(t *testing.T) {
431447
"instance-4": {Addr: "127.0.0.4", State: ACTIVE, Timestamp: now.Add(-30 * time.Second).Unix(), Tokens: GenerateTokens(128, nil)},
432448
"instance-5": {Addr: "127.0.0.5", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
433449
},
450+
ringHeartbeatTimeout: time.Minute,
434451
ringReplicationFactor: 1,
435452
expectedErrForRead: ErrTooManyUnhealthyInstances,
436453
expectedErrForWrite: ErrTooManyUnhealthyInstances,
@@ -444,6 +461,7 @@ func TestRing_GetReplicationSetForOperation(t *testing.T) {
444461
"instance-4": {Addr: "127.0.0.4", State: ACTIVE, Timestamp: now.Add(-30 * time.Second).Unix(), Tokens: GenerateTokens(128, nil)},
445462
"instance-5": {Addr: "127.0.0.5", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
446463
},
464+
ringHeartbeatTimeout: time.Minute,
447465
ringReplicationFactor: 3,
448466
expectedSetForRead: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4"},
449467
expectedSetForWrite: []string{"127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4"},
@@ -457,6 +475,7 @@ func TestRing_GetReplicationSetForOperation(t *testing.T) {
457475
"instance-4": {Addr: "127.0.0.4", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
458476
"instance-5": {Addr: "127.0.0.5", State: ACTIVE, Timestamp: now.Add(-2 * time.Minute).Unix(), Tokens: GenerateTokens(128, nil)},
459477
},
478+
ringHeartbeatTimeout: time.Minute,
460479
ringReplicationFactor: 3,
461480
expectedErrForRead: ErrTooManyUnhealthyInstances,
462481
expectedErrForWrite: ErrTooManyUnhealthyInstances,
@@ -474,7 +493,7 @@ func TestRing_GetReplicationSetForOperation(t *testing.T) {
474493

475494
ring := Ring{
476495
cfg: Config{
477-
HeartbeatTimeout: heartbeatTimeout,
496+
HeartbeatTimeout: testData.ringHeartbeatTimeout,
478497
ReplicationFactor: testData.ringReplicationFactor,
479498
},
480499
ringDesc: ringDesc,

pkg/ruler/ruler_ring.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
5757
// Ring flags
5858
cfg.KVStore.RegisterFlagsWithPrefix("ruler.ring.", "rulers/", f)
5959
f.DurationVar(&cfg.HeartbeatPeriod, "ruler.ring.heartbeat-period", 5*time.Second, "Period at which to heartbeat to the ring.")
60-
f.DurationVar(&cfg.HeartbeatTimeout, "ruler.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which rulers are considered unhealthy within the ring.")
60+
f.DurationVar(&cfg.HeartbeatTimeout, "ruler.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which rulers are considered unhealthy within the ring. 0 = never (timeout disabled).")
6161

6262
// Instance flags
6363
cfg.InstanceInterfaceNames = []string{"eth0", "en0"}

pkg/storegateway/gateway_ring.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
9595
// Ring flags
9696
cfg.KVStore.RegisterFlagsWithPrefix(ringFlagsPrefix, "collectors/", f)
9797
f.DurationVar(&cfg.HeartbeatPeriod, ringFlagsPrefix+"heartbeat-period", 15*time.Second, "Period at which to heartbeat to the ring.")
98-
f.DurationVar(&cfg.HeartbeatTimeout, ringFlagsPrefix+"heartbeat-timeout", time.Minute, "The heartbeat timeout after which store gateways are considered unhealthy within the ring."+sharedOptionWithQuerier)
98+
f.DurationVar(&cfg.HeartbeatTimeout, ringFlagsPrefix+"heartbeat-timeout", time.Minute, "The heartbeat timeout after which store gateways are considered unhealthy within the ring. 0 = never (timeout disabled)."+sharedOptionWithQuerier)
9999
f.IntVar(&cfg.ReplicationFactor, ringFlagsPrefix+"replication-factor", 3, "The replication factor to use when sharding blocks."+sharedOptionWithQuerier)
100100
f.StringVar(&cfg.TokensFilePath, ringFlagsPrefix+"tokens-file-path", "", "File path where tokens are stored. If empty, tokens are not stored at shutdown and restored at startup.")
101101
f.BoolVar(&cfg.ZoneAwarenessEnabled, ringFlagsPrefix+"zone-awareness-enabled", false, "True to enable zone-awareness and replicate blocks across different availability zones.")

0 commit comments

Comments
 (0)