-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mixin: Fix alert about unhealthy sidecar #2929
Conversation
6c4a018
to
6678f08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 🥇 It needs rebasing otherwise we can merge
6678f08
to
8960536
Compare
@kakkoyun thank you for the approval. I have just rebased it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized a minor thing, could have another look at it?
@hwoarang Friendly ping. |
@kakkoyun apologies for the delay but due to holidays, connectivity and time are at a premium :) nevertheless I will try to get to it as soon as possible. |
c82af6d
to
bb529df
Compare
@kakkoyun I believe I have addressed all your concerns now |
The alert was giving the wrong information as the $value contained the number of pods that failing to send heartbeat instead of the actual number of seconds that each sidecar was being unhealthy. Also the 5 minute interval is probably too low as on large deployments prometheus could take much longer to come up online and for sidecar to become actually useful. As such, we can simply subtract the timestamp of the last heartbeat from the current time and fire if we are lagging for more than 10 minutes. Signed-off-by: Markos Chandras <markos@chandras.me>
bb529df
to
c3bdfbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@hwoarang Thanks a lot 🙏 |
The alert was giving the wrong information as the $value contained
the number of pods that failing to send heartbeat instead of the actual
number of seconds that each sidecar was being unhealthy.
Also the 5 minute interval is probably too low as on large deployments
prometheus could take much longer to come up online and for sidecar to
become actually useful.
As such, we can simply subtract the timestamp of the last heartbeat from
the current time and fire if we are lagging for more than 10 minutes.
This is also consistent with the current unit tests.
Changes
Improve alert for sidecar to provide correct information and not fire too soon
Verification
Deployed locally and did a prometheus update and it did not fire anymore