-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix: set context timeout for resolvePeers #4343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hey @grobinson-grafana 👋 |
| resolvedPeers, err := resolvePeers(context.Background(), knownPeers, advertiseAddr, &net.Resolver{}, waitIfEmpty) | ||
| ctx, cancel := context.WithTimeout(context.Background(), resolveTimeout) | ||
| defer cancel() | ||
| resolvedPeers, err := resolvePeers(ctx, knownPeers, advertiseAddr, &net.Resolver{}, waitIfEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I understand. If this timesout, Alertmanager will still fail to start because the peer cannot be resolved, but at least this time it emits an error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think AM will fail to start. See https://github.com/prometheus/alertmanager/blob/main/cluster/cluster.go#L737-L740. It just takes the direct addr & moves on with the loop.
// inlining
ips, err := res.LookupIPAddr(ctx, host)
if err != nil {
// Assume direct address.
resolvedPeers = append(resolvedPeers, peer)
continue
}Afterwards we will see debug logs https://github.com/prometheus/alertmanager/blob/main/cluster/cluster.go#L173 resolved addr for peers.
l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ","))Far better than hanging app with no logs emitted, imho 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how does Alertmanager behave if it continues without having resolved peers? I haven't had time to look into it I'm afraid so I'm asking here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there is a test for initially failed peers. I reckon, am(alertmngr) tries to reconnect for certain amount of time and eventually drops failed peers. Nevertheless, health probes and rest would work same. Only gossip settlings wont. That is my understanding at least :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refresh method is called by with runPeriodicTask, so if alertmanager fails to resolve peers on start, it'll try again after DefaultRefreshInterval.
And I agree that alertmanager won't fail to start/crash - failing to resolve peers only writes a debug message:. Furthermore, failing to join the cluster is not treated as a fatal error at startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I agree that alertmanager won't fail to start/crash - failing to resolve peers only writes a debug message:. Furthermore, failing to join the cluster is not treated as a fatal error at startup.
Ah sorry, this is the wrong reasoning since cluster.Create calls resolvePeers and errors from that call are treated as fatal. However, as @eyazici90 already noted, resolution errors aren't bubbled up out of resolvePeers unless the initial lookup returns no results and no error and then future lookups raise an error
|
Could u help me with review @SuperQ ? |
Spaceman1701
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a more detailed read through the code and didn't find that concerns me.
I think this makes a lot of sense, and it's very bad behavior to hang on resolvePeers right now.
I am a little nervous about this change since it essentially adds a new exit condition, but I think 15 seconds is more than long enough to do a few DNS resolves. I guess if other people are also concerned, we could bump it to 30s by default, but that seems extremely long.
|
Looks like a couple tests need fixing. |
e1138e9 to
01a546b
Compare
Given resolvePeers is context aware, yet, uses background that may lead to blocking call. This is due to the fact that `LookupIPAddr` is called under the hood that does dns look up which may be blocked in some platforms. In case of dns lookups are blocked may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as readiness check to fail. Signed-off-by: emreya <e.yazici1990@gmail.com>
Signed-off-by: emreya <e.yazici1990@gmail.com>
Signed-off-by: emreya <e.yazici1990@gmail.com>
I am fine to change it as |
Fixed @SuperQ |
Context
Given resolvePeers is context aware, yet, it is used by background context that may lead to blocking call. This is due to the fact that
LookupIPAddris called under the hood that does dns look up which may be blocked in some platforms.In case of dns lookups are blocked, it may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as keep getting restarted by kube with no info in the app.
Below logs are taken from one of the alertmanager kube/pods.
Cillumpolicy was not applied to allow to dns resolve yet. This led to app hangs with no logs & keep getting restarted due to liveness probes(
/healthy,/ready) were not registered. Which was pretty painful to troubleshoot.This was due to LookupIPAddr hanging on main goroutine.
Letting the control flow continues will lead to l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ",")) that will not block the app to log as well as registering liveness probes(
/healthy,/ready).