-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft loop prober with counter #16713
base: main
Are you sure you want to change the base?
Conversation
cc @ahrtr |
This approach depends on the liveness probe's frequency / interval, the principle is Overall looks good. Please feel free to reuse (might need minor update) the test case in #16710. |
server/etcdserver/server.go
Outdated
@@ -1643,6 +1644,10 @@ func (s *EtcdServer) AppliedIndex() uint64 { return s.getAppliedIndex() } | |||
|
|||
func (s *EtcdServer) Term() uint64 { return s.getTerm() } | |||
|
|||
// ProbeRaftLoopProgress probes if the etcdserver raft loop is deadlocked | |||
// since last time the function is invoked. | |||
func (s *EtcdServer) ProbeRaftLoopProgress() bool { return s.r.safeReadTickElapsedAndClear() != 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probe response should not be determined based on frequency of probing. If someone probes etcd more frequently then proposals are happening then this will return failure.
I still need to think more about the best approach (maybe a document that collects and compares the approaches would help?), but this could be improved by waiting for a tick instead of failing if there was no ticks since last probe. Implementation would be tricky with concurrency, let me know if this sounds like a good approach to you so I could help. :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probe response should not be determined based on frequency of probing. If someone probes etcd more frequently then proposals are happening then this will return failure.
Thanks, in that case maybe adding a simple throttler in etcdserver raft Node should help.
I still need to think more about the best approach (maybe a document that collects and compares the approaches would help?), but this could be improved by waiting for a tick instead of failing if there was no ticks since last probe.
Yeah, a small doc compares with pros and cons should help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also interested in how apiserver implements /livez and /readyz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also interested in how apiserver implements /livez and /readyz.
https://kubernetes.io/docs/reference/using-api/health-checks/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, health probes in apiserver generally are not complicated. They mostly checks if something has started or initialized, such as https://github.com/kubernetes/kubernetes/blob/c7d270302c8de3afc9d7b01c70faf3a18407ce44/staging/src/k8s.io/client-go/tools/cache/shared_informer.go#L173
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added throttler to make raft loop prober check effective at most once every 5 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As promised, I wrote a small doc to compare the pros and cons.
Could you please take a look? @ahrtr @serathius @siyuanfoundation Thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Overall this PR looks good to me, also added comments in the doc.
6e24d08
to
a109a7e
Compare
Signed-off-by: Chao Chen <chaochn@amazon.com>
a109a7e
to
d126728
Compare
Discussed during sig-etcd triage meeting. This is still an important pr for etcd, @chaochn47 do you have some time to rebase and continue this effort? |
Yeah, I will take a look and drive to the resolution next week. |
@chaochn47: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.