Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul downtime when leader leaves #3893

Closed
holtwilkins opened this issue Feb 19, 2018 · 7 comments
Closed

Consul downtime when leader leaves #3893

holtwilkins opened this issue Feb 19, 2018 · 7 comments
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@holtwilkins
Copy link

holtwilkins commented Feb 19, 2018

Description of the Issue (and unexpected/desired result)

I would expect that, if the consul leader issues consul leave, clients shouldn't receive RPC failures from consul failing to have elected a new leader. Instead, there seems to be around 15s where clients can receive failures until consul has elected a new leader. Is there a "more correct" way to have a consul leader leave the cluster?

Reproduction steps

Issue consul leave on the cluster leader.

consul version for both Client and Server

Client: v1.0.6
Server: v1.0.6

consul info for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 5
	services = 5
build:
	prerelease =
	revision = 9a494b5f
	version = 1.0.6
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 57
	max_procs = 2
	os = linux
	version = go1.9.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 169
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 6
	member_time = 1774
	members = 14
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 3
	services = 3
build:
	prerelease =
	revision = 9a494b5f
	version = 1.0.6
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.2.210.100:8300
	server = true
raft:
	applied_index = 19809408
	commit_index = 19809408
	fsm_pending = 0
	last_contact = 0
	last_log_index = 19809408
	last_log_term = 393
	last_snapshot_index = 19802623
	last_snapshot_term = 383
	latest_configuration = [{Suffrage:Voter ID:9859a83e-e11e-aff2-6165-fb151922c4dc Address:10.2.210.100:8300} {Suffrage:Voter ID:d4a2a349-43f0-d5e4-521a-55d9ebd2a265 Address:10.2.210.201:8300} {Suffrage:Voter ID:f776da6d-3164-2b26-ca70-9fc72ce6d655 Address:10.2.211.100:8300}]
	latest_configuration_index = 19808688
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 393
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 191
	max_procs = 4
	os = linux
	version = go1.9.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 169
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 2
	member_time = 1774
	members = 10
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

Ubuntu 16.04 in AWS.

Log Fragments or Link to gist

Feb 19 01:50:29 ip-10-2-211-212 consul[3175]:     2018/02/19 01:50:29 [INFO] serf: EventMemberJoin: server-use1-0-10-2-210-100 10.2.210.100
Feb 19 01:50:29 ip-10-2-211-212 consul[3175]:     2018/02/19 01:50:29 [INFO] consul: adding server server-use1-0-10-2-210-100 (Addr: tcp/10.2.210.100:8300) (DC: us-east-1)
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] serf: EventMemberFailed: server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] consul: removing server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] serf: EventMemberLeave (forced): server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] consul: removing server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:55:16 ip-10-2-211-212 consul[3175]:     2018/02/19 01:55:16 [INFO] serf: EventMemberJoin: server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:55:16 ip-10-2-211-212 consul[3175]:     2018/02/19 01:55:16 [INFO] consul: adding server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:57:07 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:07 [INFO] serf: EventMemberLeave: server-use1-2-10-2-211-100 10.2.211.100
Feb 19 01:57:07 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:07 [INFO] consul: removing server server-use1-2-10-2-211-100 (Addr: tcp/10.2.211.100:8300) (DC: us-east-1)
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] consul: "Health.ChecksInState" RPC failed to server 10.2.211.100:8300: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] http: Request GET /v1/health/state/any?consistent=&index=19808594, error: rpc error making call: EOF from=127.0.0.1:48420
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.100:8300: rpc error making call: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] http: Request GET /v1/kv/fabio/config?consistent=&index=19808480, error: rpc error making call: rpc error making call: EOF from=127.0.0.1:47800
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.100:8300: rpc error making call: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] http: Request GET /v1/kv/fabio/noroute.html?consistent=&index=19341163, error: rpc error making call: rpc error making call: EOF from=127.0.0.1:44304
Feb 19 01:57:20 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:20 [ERR] consul: "Health.ChecksInState" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:20 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:20 [ERR] http: Request GET /v1/health/state/any?consistent=&index=19808594, error: rpc error making call: No cluster leader from=127.0.0.1:49212
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] http: Request GET /v1/kv/fabio/config?consistent=&index=19808480, error: rpc error making call: No cluster leader from=127.0.0.1:49653
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] http: Request GET /v1/kv/fabio/noroute.html?consistent=&index=19341163, error: rpc error making call: No cluster leader from=127.0.0.1:49652
Feb 19 01:57:23 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:23 [INFO] consul: New leader elected: server-use1-0-10-2-210-100

EDIT: I failed to note before that as of the time of the log snippet above, server-use1-2-10-2-211-100 was the cluster leader at the beginning of the log snippet. So, in the snippet, we see it exiting the consul cluster, then see the RPC errors, then consul elects server-use1-0-10-2-210-100 as the leader.

@preetapan
Copy link
Contributor

@holtwilkins We made some changes in #3514 to make RPC handling during leader elections more robust. Config options under performance can be used to mitigate 500s during server leaves and leader elections.

Looking at your logs, it appears that the default values for leave_drain_time and rpc_hold_timeout may not be enough for your environment. I suggest bumping them both up - the first one allows in-flight RPCs to the leader to finish when you issue consul leave and the second option allows other internal RPCs during leader election to retry.

@preetapan preetapan added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label Feb 19, 2018
@holtwilkins
Copy link
Author

Thanks @preetapan , I had missed these options being added in the 1.0 notes then. I'm curious why you're suggesting increasing those two values as opposed to lowering the raft_multiplier down to 1? If I should increase those two timeouts, should rpc_hold_timeout always be x% longer than leave_drain_time or something?

Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections...

@banks
Copy link
Member

banks commented Feb 19, 2018

Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections...

@holtwilkins yeah the defaults are intentionally tuned for minimal, low power setups as noted in https://www.consul.io/docs/guides/performance.html#minimum-server-requirements and the specific link @preetapan gave before.

The link above notes why the defaults are chosen as they are, it's expected and hopefully clearly documented that "production" installations are likely to need to tune those.

Setting [raft_multiplier] to a value of 1 will configure Raft to its highest-performance mode, equivalent to the default timing of Consul prior to 0.7, and is recommended for production Consul servers.

Not sure if Preetha assumed you had already followed the docs for a production setup but yeah seems you should try that first and then increase the timeout's if you still see issues in your environment.

Hope this helps :)

@holtwilkins
Copy link
Author

Yup @banks I see the notes about "production" clusters now. I dropped raft_multiplier to 1, and it helped a lot, but I still got a 500 after a few leaves, so I've set leave_drain_time to 15s and rpc_hold_timeout to 20s, but I'm still not clear on the relationship between these two values and why their defaults are different from each other?

@guidoiaquinti
Copy link
Contributor

@banks @preetapan can you please expand on @holtwilkins request? We are currently facing the same issue in our environment

@banks
Copy link
Member

banks commented May 2, 2018

Hi @guidoiaquinti @holtwilkins sorry for the delay I missed the mention here.

but I'm still not clear on the relationship between these two values and why their defaults are different from each other?

The simplest difference is that leave_drain_time is how long the server waits during a graceful leave (typical graceful shutdown timeout as with most HTTP/RPC servers):

// Start refusing RPCs now that we've left the LAN pool. It's important
// to do this *after* we've left the LAN pool so that clients will know
// to shift onto another server if they perform a retry. We also wake up
// all queries in the RPC retry state.
s.logger.Printf("[INFO] consul: Waiting %s to drain RPC traffic", s.config.LeaveDrainTime)
close(s.leaveCh)
time.Sleep(s.config.LeaveDrainTime)

While rpc_hold_timeout is how long the client (and any server that forwards an RPC to leader) will wait while RPC retries are made:

Client:

// This is subtle but we start measuring the time on the client side
// right at the time of the first request, vs. on the first retry as
// is done on the server side inside forward(). This is because the
// servers may already be applying the RPCHoldTimeout up there, so by
// starting the timer here we won't potentially double up the delay.
// TODO (slackpad) Plumb a deadline here with a context.
firstCheck := time.Now()
TRY:
server := c.routers.FindServer()
if server == nil {
return structs.ErrNoServers
}
// Enforce the RPC limit.
metrics.IncrCounter([]string{"consul", "client", "rpc"}, 1)
metrics.IncrCounter([]string{"client", "rpc"}, 1)
if !c.rpcLimiter.Allow() {
metrics.IncrCounter([]string{"consul", "client", "rpc", "exceeded"}, 1)
metrics.IncrCounter([]string{"client", "rpc", "exceeded"}, 1)
return structs.ErrRPCRateExceeded
}
// Make the request.
rpcErr := c.connPool.RPC(c.config.Datacenter, server.Addr, server.Version, method, server.UseTLS, args, reply)
if rpcErr == nil {
return nil
}
// Move off to another server, and see if we can retry.
c.logger.Printf("[ERR] consul: %q RPC failed to server %s: %v", method, server.Addr, rpcErr)
c.routers.NotifyFailedServer(server)
if retry := canRetry(args, rpcErr); !retry {
return rpcErr
}
// We can wait a bit and retry!
if time.Since(firstCheck) < c.config.RPCHoldTimeout {
jitter := lib.RandomStagger(c.config.RPCHoldTimeout / jitterFraction)

And servers forwarding to leader:

consul/agent/consul/rpc.go

Lines 242 to 248 in 06f9800

RETRY:
// Gate the request until there is a leader
if firstCheck.IsZero() {
firstCheck = time.Now()
}
if time.Since(firstCheck) < s.config.RPCHoldTimeout {
jitter := lib.RandomStagger(s.config.RPCHoldTimeout / jitterFraction)

It's typical to set hold timeout a little higher to account for the time taken for initial and final round trips and scheduling delays etc.

Those retry/leave drain times need to be set somewhat higher than the slowest leader election might take in your environment to ensure you never see 500. Of course there is a tradeoff there if they get too long then eventually clients might give up on the response anyway rather than wait 20+ seconds for a success.

The raft_multiplier controls how quickly leader elections can happen. We recommend 1 for production but it's 5 by default per the docs because many people first spin up Consul on very small VMs with limited network and CPU performance and complained that they saw leader instability.

There is a tradeoff between leader stability (false positive failures caused by high CPU or network variability causing heartbeats to be delayed by a few hundred milliseconds) and how long it takes to detect a real leader failure. We expect production users to have well provisioned machines for their load so having raft_multiplier = 1 should not cause to much "flapping" of leader ship and allow quick leader failure detection.

I realise I've not given a super clear answer on exactly what to set and that's because this is all fairly subtle and variable stuff - depends on your hardware, workload, topology etc. I hope that helps build a mental model for how those parameters interact.

@ChipV223
Copy link
Contributor

@holtwilkins : Closing this due to in-activity and later versions of Consul have made improvements with how the cluster functions when a node gracefully leaves, but feel free to re-open this with more information should you come across this again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

5 participants