Skip to content

Commit 4f45f5d

Browse files
authored
Merge pull request #9591 from gyuho/election
*: add --initial-election-tick-advance to configure election fast-forward on bootstrap
2 parents e81f9d8 + 2d7cb9d commit 4f45f5d

File tree

9 files changed

+146
-39
lines changed

9 files changed

+146
-39
lines changed

CHANGELOG-3.3.md

+18
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.3...v3.3.4) and [
66

77
### Metrics, Monitoring
88

9+
- Add [`etcd_server_is_leader`](https://github.com/coreos/etcd/pull/9587) Prometheus metric.
910
- Fix [`etcd_debugging_server_lease_expired_total`](https://github.com/coreos/etcd/pull/9557) Prometheus metric.
1011
- Fix [race conditions in v2 server stat collecting](https://github.com/coreos/etcd/pull/9562).
1112

@@ -16,6 +17,23 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.3...v3.3.4) and [
1617
- However, a certificate whose SAN field does [not include any domain names but only IP addresses](https://github.com/coreos/etcd/issues/9541) would request `*tls.ClientHelloInfo` with an empty `ServerName` field, thus failing to trigger the TLS reload on initial TLS handshake; this becomes a problem when expired certificates need to be replaced online.
1718
- Now, `(*tls.Config).Certificates` is created empty on initial TLS client handshake, first to trigger `(*tls.Config).GetCertificate`, and then to populate rest of the certificates on every new TLS connection, even when client SNI is empty (e.g. cert only includes IPs).
1819

20+
### Added: `etcd`
21+
22+
- Add [`--initial-election-tick-advance`](https://github.com/coreos/etcd/pull/9591) flag to configure initial election tick fast-forward.
23+
- By default, `--initial-election-tick-advance=true`, then local member fast-forwards election ticks to speed up "initial" leader election trigger.
24+
- This benefits the case of larger election ticks. For instance, cross datacenter deployment may require longer election timeout of 10-second. If true, local node does not need wait up to 10-second. Instead, forwards its election ticks to 8-second, and have only 2-second left before leader election.
25+
- Major assumptions are that: cluster has no active leader thus advancing ticks enables faster leader election. Or cluster already has an established leader, and rejoining follower is likely to receive heartbeats from the leader after tick advance and before election timeout.
26+
- However, when network from leader to rejoining follower is congested, and the follower does not receive leader heartbeat within left election ticks, disruptive election has to happen thus affecting cluster availabilities.
27+
- Now, this can be disabled by setting `--initial-election-tick-advance=false`.
28+
- Disabling this would slow down initial bootstrap process for cross datacenter deployments. Make tradeoffs by configuring `--initial-election-tick-advance` at the cost of slow initial bootstrap.
29+
- If single-node, it advances ticks regardless.
30+
- Address [disruptive rejoining follower node](https://github.com/coreos/etcd/issues/9333).
31+
32+
### Added: `embed`
33+
34+
- Add [`embed.Config.InitialElectionTickAdvance`](https://github.com/coreos/etcd/pull/9591) to enable/disable initial election tick fast-forward.
35+
- `embed.NewConfig()` would return `*embed.Config` with `InitialElectionTickAdvance` as true by default.
36+
1937

2038
## [v3.3.3](https://github.com/coreos/etcd/releases/tag/v3.3.3) (2018-03-29)
2139

CHANGELOG-3.4.md

+12
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [
9494

9595
### Metrics, Monitoring
9696

97+
- Add [`etcd_server_is_leader`](https://github.com/coreos/etcd/pull/9587) Prometheus metric.
9798
- Add [`etcd_debugging_mvcc_db_total_size_in_use_in_bytes`](https://github.com/coreos/etcd/pull/9256) Prometheus metric.
9899
- Add missing [`etcd_network_peer_sent_failures_total` count](https://github.com/coreos/etcd/pull/9437).
99100
- Fix [`etcd_debugging_server_lease_expired_total`](https://github.com/coreos/etcd/pull/9557) Prometheus metric.
@@ -124,6 +125,15 @@ See [security doc](https://github.com/coreos/etcd/blob/master/Documentation/op-g
124125

125126
### Added: `etcd`
126127

128+
- Add [`--initial-election-tick-advance`](https://github.com/coreos/etcd/pull/9591) flag to configure initial election tick fast-forward.
129+
- By default, `--initial-election-tick-advance=true`, then local member fast-forwards election ticks to speed up "initial" leader election trigger.
130+
- This benefits the case of larger election ticks. For instance, cross datacenter deployment may require longer election timeout of 10-second. If true, local node does not need wait up to 10-second. Instead, forwards its election ticks to 8-second, and have only 2-second left before leader election.
131+
- Major assumptions are that: cluster has no active leader thus advancing ticks enables faster leader election. Or cluster already has an established leader, and rejoining follower is likely to receive heartbeats from the leader after tick advance and before election timeout.
132+
- However, when network from leader to rejoining follower is congested, and the follower does not receive leader heartbeat within left election ticks, disruptive election has to happen thus affecting cluster availabilities.
133+
- Now, this can be disabled by setting `--initial-election-tick-advance=false`.
134+
- Disabling this would slow down initial bootstrap process for cross datacenter deployments. Make tradeoffs by configuring `--initial-election-tick-advance` at the cost of slow initial bootstrap.
135+
- If single-node, it advances ticks regardless.
136+
- Address [disruptive rejoining follower node](https://github.com/coreos/etcd/issues/9333).
127137
- Add [`--pre-vote`](https://github.com/coreos/etcd/pull/9352) flag to enable to run an additional Raft election phase.
128138
- For instance, a flaky(or rejoining) member may drop in and out, and start campaign. This member will end up with a higher term, and ignore all incoming messages with lower term. In this case, a new leader eventually need to get elected, thus disruptive to cluster availability. Raft implements Pre-Vote phase to prevent this kind of disruptions. If enabled, Raft runs an additional phase of election to check if pre-candidate can get enough votes to win an election.
129139
- `--pre-vote=false` by default.
@@ -155,6 +165,8 @@ See [security doc](https://github.com/coreos/etcd/blob/master/Documentation/op-g
155165

156166
### Added: `embed`
157167

168+
- Add [`embed.Config.InitialElectionTickAdvance`](https://github.com/coreos/etcd/pull/9591) to enable/disable initial election tick fast-forward.
169+
- `embed.NewConfig()` would return `*embed.Config` with `InitialElectionTickAdvance` as true by default.
158170
- Add [`embed.Config.Logger`](https://github.com/coreos/etcd/pull/9518) to support [structured logger `zap`](https://github.com/uber-go/zap) in server-side.
159171
- Define [`embed.CompactorModePeriodic`](https://godoc.org/github.com/coreos/etcd/embed#pkg-variables) for `compactor.ModePeriodic`.
160172
- Define [`embed.CompactorModeRevision`](https://godoc.org/github.com/coreos/etcd/embed#pkg-variables) for `compactor.ModeRevision`.

embed/config.go

+35-4
Original file line numberDiff line numberDiff line change
@@ -121,8 +121,38 @@ type Config struct {
121121
// TickMs is the number of milliseconds between heartbeat ticks.
122122
// TODO: decouple tickMs and heartbeat tick (current heartbeat tick = 1).
123123
// make ticks a cluster wide configuration.
124-
TickMs uint `json:"heartbeat-interval"`
125-
ElectionMs uint `json:"election-timeout"`
124+
TickMs uint `json:"heartbeat-interval"`
125+
ElectionMs uint `json:"election-timeout"`
126+
127+
// InitialElectionTickAdvance is true, then local member fast-forwards
128+
// election ticks to speed up "initial" leader election trigger. This
129+
// benefits the case of larger election ticks. For instance, cross
130+
// datacenter deployment may require longer election timeout of 10-second.
131+
// If true, local node does not need wait up to 10-second. Instead,
132+
// forwards its election ticks to 8-second, and have only 2-second left
133+
// before leader election.
134+
//
135+
// Major assumptions are that:
136+
// - cluster has no active leader thus advancing ticks enables faster
137+
// leader election, or
138+
// - cluster already has an established leader, and rejoining follower
139+
// is likely to receive heartbeats from the leader after tick advance
140+
// and before election timeout.
141+
//
142+
// However, when network from leader to rejoining follower is congested,
143+
// and the follower does not receive leader heartbeat within left election
144+
// ticks, disruptive election has to happen thus affecting cluster
145+
// availabilities.
146+
//
147+
// Disabling this would slow down initial bootstrap process for cross
148+
// datacenter deployments. Make your own tradeoffs by configuring
149+
// --initial-election-tick-advance at the cost of slow initial bootstrap.
150+
//
151+
// If single-node, it advances ticks regardless.
152+
//
153+
// See https://github.com/coreos/etcd/issues/9333 for more detail.
154+
InitialElectionTickAdvance bool `json:"initial-election-tick-advance"`
155+
126156
QuotaBackendBytes int64 `json:"quota-backend-bytes"`
127157
MaxTxnOps uint `json:"max-txn-ops"`
128158
MaxRequestBytes uint `json:"max-request-bytes"`
@@ -305,8 +335,9 @@ func NewConfig() *Config {
305335
GRPCKeepAliveInterval: DefaultGRPCKeepAliveInterval,
306336
GRPCKeepAliveTimeout: DefaultGRPCKeepAliveTimeout,
307337

308-
TickMs: 100,
309-
ElectionMs: 1000,
338+
TickMs: 100,
339+
ElectionMs: 1000,
340+
InitialElectionTickAdvance: true,
310341

311342
LPUrls: []url.URL{*lpurl},
312343
LCUrls: []url.URL{*lcurl},

embed/etcd.go

+34-33
Original file line numberDiff line numberDiff line change
@@ -158,39 +158,40 @@ func StartEtcd(inCfg *Config) (e *Etcd, err error) {
158158
}
159159

160160
srvcfg := etcdserver.ServerConfig{
161-
Name: cfg.Name,
162-
ClientURLs: cfg.ACUrls,
163-
PeerURLs: cfg.APUrls,
164-
DataDir: cfg.Dir,
165-
DedicatedWALDir: cfg.WalDir,
166-
SnapCount: cfg.SnapCount,
167-
MaxSnapFiles: cfg.MaxSnapFiles,
168-
MaxWALFiles: cfg.MaxWalFiles,
169-
InitialPeerURLsMap: urlsmap,
170-
InitialClusterToken: token,
171-
DiscoveryURL: cfg.Durl,
172-
DiscoveryProxy: cfg.Dproxy,
173-
NewCluster: cfg.IsNewCluster(),
174-
PeerTLSInfo: cfg.PeerTLSInfo,
175-
TickMs: cfg.TickMs,
176-
ElectionTicks: cfg.ElectionTicks(),
177-
AutoCompactionRetention: autoCompactionRetention,
178-
AutoCompactionMode: cfg.AutoCompactionMode,
179-
QuotaBackendBytes: cfg.QuotaBackendBytes,
180-
MaxTxnOps: cfg.MaxTxnOps,
181-
MaxRequestBytes: cfg.MaxRequestBytes,
182-
StrictReconfigCheck: cfg.StrictReconfigCheck,
183-
ClientCertAuthEnabled: cfg.ClientTLSInfo.ClientCertAuth,
184-
AuthToken: cfg.AuthToken,
185-
CORS: cfg.CORS,
186-
HostWhitelist: cfg.HostWhitelist,
187-
InitialCorruptCheck: cfg.ExperimentalInitialCorruptCheck,
188-
CorruptCheckTime: cfg.ExperimentalCorruptCheckTime,
189-
PreVote: cfg.PreVote,
190-
Logger: cfg.logger,
191-
LoggerConfig: cfg.loggerConfig,
192-
Debug: cfg.Debug,
193-
ForceNewCluster: cfg.ForceNewCluster,
161+
Name: cfg.Name,
162+
ClientURLs: cfg.ACUrls,
163+
PeerURLs: cfg.APUrls,
164+
DataDir: cfg.Dir,
165+
DedicatedWALDir: cfg.WalDir,
166+
SnapCount: cfg.SnapCount,
167+
MaxSnapFiles: cfg.MaxSnapFiles,
168+
MaxWALFiles: cfg.MaxWalFiles,
169+
InitialPeerURLsMap: urlsmap,
170+
InitialClusterToken: token,
171+
DiscoveryURL: cfg.Durl,
172+
DiscoveryProxy: cfg.Dproxy,
173+
NewCluster: cfg.IsNewCluster(),
174+
PeerTLSInfo: cfg.PeerTLSInfo,
175+
TickMs: cfg.TickMs,
176+
ElectionTicks: cfg.ElectionTicks(),
177+
InitialElectionTickAdvance: cfg.InitialElectionTickAdvance,
178+
AutoCompactionRetention: autoCompactionRetention,
179+
AutoCompactionMode: cfg.AutoCompactionMode,
180+
QuotaBackendBytes: cfg.QuotaBackendBytes,
181+
MaxTxnOps: cfg.MaxTxnOps,
182+
MaxRequestBytes: cfg.MaxRequestBytes,
183+
StrictReconfigCheck: cfg.StrictReconfigCheck,
184+
ClientCertAuthEnabled: cfg.ClientTLSInfo.ClientCertAuth,
185+
AuthToken: cfg.AuthToken,
186+
CORS: cfg.CORS,
187+
HostWhitelist: cfg.HostWhitelist,
188+
InitialCorruptCheck: cfg.ExperimentalInitialCorruptCheck,
189+
CorruptCheckTime: cfg.ExperimentalCorruptCheckTime,
190+
PreVote: cfg.PreVote,
191+
Logger: cfg.logger,
192+
LoggerConfig: cfg.loggerConfig,
193+
Debug: cfg.Debug,
194+
ForceNewCluster: cfg.ForceNewCluster,
194195
}
195196
if e.Server, err = etcdserver.NewServer(srvcfg); err != nil {
196197
return e, err

etcdmain/config.go

+1
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ func newConfig() *config {
153153
fs.Uint64Var(&cfg.ec.SnapCount, "snapshot-count", cfg.ec.SnapCount, "Number of committed transactions to trigger a snapshot to disk.")
154154
fs.UintVar(&cfg.ec.TickMs, "heartbeat-interval", cfg.ec.TickMs, "Time (in milliseconds) of a heartbeat interval.")
155155
fs.UintVar(&cfg.ec.ElectionMs, "election-timeout", cfg.ec.ElectionMs, "Time (in milliseconds) for an election to timeout.")
156+
fs.BoolVar(&cfg.ec.InitialElectionTickAdvance, "initial-election-tick-advance", cfg.ec.InitialElectionTickAdvance, "Whether to fast-forward initial election ticks on boot for faster election.")
156157
fs.Int64Var(&cfg.ec.QuotaBackendBytes, "quota-backend-bytes", cfg.ec.QuotaBackendBytes, "Raise alarms when backend size exceeds the given quota. 0 means use the default quota.")
157158
fs.UintVar(&cfg.ec.MaxTxnOps, "max-txn-ops", cfg.ec.MaxTxnOps, "Maximum number of operations permitted in a transaction.")
158159
fs.UintVar(&cfg.ec.MaxRequestBytes, "max-request-bytes", cfg.ec.MaxRequestBytes, "Maximum client request size in bytes the server will accept.")

etcdmain/help.go

+2
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ Member:
5555
Time (in milliseconds) of a heartbeat interval.
5656
--election-timeout '1000'
5757
Time (in milliseconds) for an election to timeout. See tuning documentation for details.
58+
--initial-election-tick-advance 'true'
59+
Whether to fast-forward initial election ticks on boot for faster election.
5860
--listen-peer-urls 'http://localhost:2380'
5961
List of URLs to listen on for peer traffic.
6062
--listen-client-urls 'http://localhost:2379'

etcdserver/config.go

+33-2
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,38 @@ type ServerConfig struct {
5555
// whose Host header value exists in this white list.
5656
HostWhitelist map[string]struct{}
5757

58-
TickMs uint
59-
ElectionTicks int
58+
TickMs uint
59+
ElectionTicks int
60+
61+
// InitialElectionTickAdvance is true, then local member fast-forwards
62+
// election ticks to speed up "initial" leader election trigger. This
63+
// benefits the case of larger election ticks. For instance, cross
64+
// datacenter deployment may require longer election timeout of 10-second.
65+
// If true, local node does not need wait up to 10-second. Instead,
66+
// forwards its election ticks to 8-second, and have only 2-second left
67+
// before leader election.
68+
//
69+
// Major assumptions are that:
70+
// - cluster has no active leader thus advancing ticks enables faster
71+
// leader election, or
72+
// - cluster already has an established leader, and rejoining follower
73+
// is likely to receive heartbeats from the leader after tick advance
74+
// and before election timeout.
75+
//
76+
// However, when network from leader to rejoining follower is congested,
77+
// and the follower does not receive leader heartbeat within left election
78+
// ticks, disruptive election has to happen thus affecting cluster
79+
// availabilities.
80+
//
81+
// Disabling this would slow down initial bootstrap process for cross
82+
// datacenter deployments. Make your own tradeoffs by configuring
83+
// --initial-election-tick-advance at the cost of slow initial bootstrap.
84+
//
85+
// If single-node, it advances ticks regardless.
86+
//
87+
// See https://github.com/coreos/etcd/issues/9333 for more detail.
88+
InitialElectionTickAdvance bool
89+
6090
BootstrapTimeout time.Duration
6191

6292
AutoCompactionRetention time.Duration
@@ -263,6 +293,7 @@ func (c *ServerConfig) print(initial bool) {
263293
zap.String("heartbeat-interval", fmt.Sprintf("%v", time.Duration(c.TickMs)*time.Millisecond)),
264294
zap.Int("election-tick-ms", c.ElectionTicks),
265295
zap.String("election-timeout", fmt.Sprintf("%v", time.Duration(c.ElectionTicks*int(c.TickMs))*time.Millisecond)),
296+
zap.Bool("initial-election-tick-advance", c.InitialElectionTickAdvance),
266297
zap.Uint64("snapshot-count", c.SnapCount),
267298
zap.Strings("advertise-client-urls", c.getACURLs()),
268299
zap.Strings("initial-advertise-peer-urls", c.getAPURLs()),

etcdserver/server.go

+10
Original file line numberDiff line numberDiff line change
@@ -635,6 +635,16 @@ func (s *EtcdServer) adjustTicks() {
635635
return
636636
}
637637

638+
if !s.Cfg.InitialElectionTickAdvance {
639+
if lg != nil {
640+
lg.Info("skipping initial election tick advance", zap.Int("election-ticks", s.Cfg.ElectionTicks))
641+
}
642+
return
643+
}
644+
if lg != nil {
645+
lg.Info("starting initial election tick advance", zap.Int("election-ticks", s.Cfg.ElectionTicks))
646+
}
647+
638648
// retry up to "rafthttp.ConnReadTimeout", which is 5-sec
639649
// until peer connection reports; otherwise:
640650
// 1. all connections failed, or

integration/cluster.go

+1
Original file line numberDiff line numberDiff line change
@@ -593,6 +593,7 @@ func mustNewMember(t *testing.T, mcfg memberConfig) *member {
593593
m.ServerConfig.PeerTLSInfo = *m.PeerTLSInfo
594594
}
595595
m.ElectionTicks = electionTicks
596+
m.InitialElectionTickAdvance = true
596597
m.TickMs = uint(tickDuration / time.Millisecond)
597598
m.QuotaBackendBytes = mcfg.quotaBackendBytes
598599
m.MaxTxnOps = mcfg.maxTxnOps

0 commit comments

Comments
 (0)