-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul downtime when leader leaves #3893
Comments
@holtwilkins We made some changes in #3514 to make RPC handling during leader elections more robust. Config options under performance can be used to mitigate 500s during server leaves and leader elections. Looking at your logs, it appears that the default values for |
Thanks @preetapan , I had missed these options being added in the 1.0 notes then. I'm curious why you're suggesting increasing those two values as opposed to lowering the raft_multiplier down to 1? If I should increase those two timeouts, should Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections... |
@holtwilkins yeah the defaults are intentionally tuned for minimal, low power setups as noted in https://www.consul.io/docs/guides/performance.html#minimum-server-requirements and the specific link @preetapan gave before. The link above notes why the defaults are chosen as they are, it's expected and hopefully clearly documented that "production" installations are likely to need to tune those.
Not sure if Preetha assumed you had already followed the docs for a production setup but yeah seems you should try that first and then increase the timeout's if you still see issues in your environment. Hope this helps :) |
Yup @banks I see the notes about "production" clusters now. I dropped |
@banks @preetapan can you please expand on @holtwilkins request? We are currently facing the same issue in our environment |
Hi @guidoiaquinti @holtwilkins sorry for the delay I missed the mention here.
The simplest difference is that Lines 760 to 766 in 3c96d64
While Client: Lines 237 to 275 in b797f46
And servers forwarding to leader: Lines 242 to 248 in 06f9800
It's typical to set hold timeout a little higher to account for the time taken for initial and final round trips and scheduling delays etc. Those retry/leave drain times need to be set somewhat higher than the slowest leader election might take in your environment to ensure you never see 500. Of course there is a tradeoff there if they get too long then eventually clients might give up on the response anyway rather than wait 20+ seconds for a success. The There is a tradeoff between leader stability (false positive failures caused by high CPU or network variability causing heartbeats to be delayed by a few hundred milliseconds) and how long it takes to detect a real leader failure. We expect production users to have well provisioned machines for their load so having I realise I've not given a super clear answer on exactly what to set and that's because this is all fairly subtle and variable stuff - depends on your hardware, workload, topology etc. I hope that helps build a mental model for how those parameters interact. |
@holtwilkins : Closing this due to in-activity and later versions of Consul have made improvements with how the cluster functions when a node gracefully leaves, but feel free to re-open this with more information should you come across this again! |
Description of the Issue (and unexpected/desired result)
I would expect that, if the consul leader issues
consul leave
, clients shouldn't receive RPC failures from consul failing to have elected a new leader. Instead, there seems to be around 15s where clients can receive failures until consul has elected a new leader. Is there a "more correct" way to have a consul leader leave the cluster?Reproduction steps
Issue
consul leave
on the cluster leader.consul version
for both Client and ServerClient:
v1.0.6
Server:
v1.0.6
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
Ubuntu 16.04 in AWS.
Log Fragments or Link to gist
EDIT: I failed to note before that as of the time of the log snippet above,
server-use1-2-10-2-211-100
was the cluster leader at the beginning of the log snippet. So, in the snippet, we see it exiting the consul cluster, then see the RPC errors, then consul electsserver-use1-0-10-2-210-100
as the leader.The text was updated successfully, but these errors were encountered: