fix: set context timeout for resolvePeers #4343

eyazici90 · 2025-04-03T15:38:12Z

Context

Given resolvePeers is context aware, yet, it is used by background context that may lead to blocking call. This is due to the fact that LookupIPAddr is called under the hood that does dns look up which may be blocked in some platforms.

In case of dns lookups are blocked, it may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as keep getting restarted by kube with no info in the app.

Below logs are taken from one of the alertmanager kube/pods.

│ ts=2025-04-03T11:12:38.668Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=HEAD, revision=x)" │
│ ts=2025-04-03T11:12:38.668Z caller=main.go:182 level=info build_context="(go=go1.21.7, platform=linux/amd64, user=root@x, date=20240228-11:51:20, tags=netgo)"

Cillumpolicy was not applied to allow to dns resolve yet. This led to app hangs with no logs & keep getting restarted due to liveness probes(/healthy, /ready) were not registered. Which was pretty painful to troubleshoot.

This was due to LookupIPAddr hanging on main goroutine.

select {
	case <-ctx.Done():
		// Our context was canceled. If we are the only
		// goroutine looking up this key, then drop the key
		// from the lookupGroup and cancel the lookup.
		// If there are other goroutines looking up this key,
		// let the lookup continue uncanceled, and let later
		// lookups with the same key share the result.
		// See issues 8602, 20703, 22724.
		if r.getLookupGroup().ForgetUnshared(lookupKey) {
			lookupGroupCancel()
			go dnsWaitGroupDone(ch, func() {})
		} else {
			go dnsWaitGroupDone(ch, lookupGroupCancel)
		}
		err := newDNSError(mapErr(ctx.Err()), host, "")
		if trace != nil && trace.DNSDone != nil {
			trace.DNSDone(nil, false, err)
		}
		return nil, err
	case r := <-ch:
		dnsWaitGroup.Done()
		lookupGroupCancel()
		err := r.Err
		if err != nil {
			if _, ok := err.(*DNSError); !ok {
				err = newDNSError(mapErr(err), host, "")
			}
		}
		if trace != nil && trace.DNSDone != nil {
			addrs, _ := r.Val.([]IPAddr)
			trace.DNSDone(ipAddrsEface(addrs), r.Shared, err)
		}
		return lookupIPReturn(r.Val, err, r.Shared)
	}

Letting the control flow continues will lead to l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ",")) that will not block the app to log as well as registering liveness probes(/healthy, /ready).

eyazici90 · 2025-04-03T15:40:45Z

Hey @grobinson-grafana 👋
Could u take a quick glance at that, lmk what u think? Based on ur response, we may push it further or not.

grobinson-grafana · 2025-04-04T12:06:05Z

cluster/cluster.go

-	resolvedPeers, err := resolvePeers(context.Background(), knownPeers, advertiseAddr, &net.Resolver{}, waitIfEmpty)
+	ctx, cancel := context.WithTimeout(context.Background(), resolveTimeout)
+	defer cancel()
+	resolvedPeers, err := resolvePeers(ctx, knownPeers, advertiseAddr, &net.Resolver{}, waitIfEmpty)


OK, I think I understand. If this timesout, Alertmanager will still fail to start because the peer cannot be resolved, but at least this time it emits an error message?

I don't think AM will fail to start. See https://github.com/prometheus/alertmanager/blob/main/cluster/cluster.go#L737-L740. It just takes the direct addr & moves on with the loop.

// inlining ips, err := res.LookupIPAddr(ctx, host) if err != nil { // Assume direct address. resolvedPeers = append(resolvedPeers, peer) continue }

Afterwards we will see debug logs https://github.com/prometheus/alertmanager/blob/main/cluster/cluster.go#L173 resolved addr for peers.

l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ","))

Far better than hanging app with no logs emitted, imho 😓

So how does Alertmanager behave if it continues without having resolved peers? I haven't had time to look into it I'm afraid so I'm asking here

Seems like there is a test for initially failed peers. I reckon, am(alertmngr) tries to reconnect for certain amount of time and eventually drops failed peers. Nevertheless, health probes and rest would work same. Only gossip settlings wont. That is my understanding at least :D

The refresh method is called by with runPeriodicTask, so if alertmanager fails to resolve peers on start, it'll try again after DefaultRefreshInterval.

And I agree that alertmanager won't fail to start/crash - failing to resolve peers only writes a debug message:. Furthermore, failing to join the cluster is not treated as a fatal error at startup.

And I agree that alertmanager won't fail to start/crash - failing to resolve peers only writes a debug message:. Furthermore, failing to join the cluster is not treated as a fatal error at startup.

Ah sorry, this is the wrong reasoning since cluster.Create calls resolvePeers and errors from that call are treated as fatal. However, as @eyazici90 already noted, resolution errors aren't bubbled up out of resolvePeers unless the initial lookup returns no results and no error and then future lookups raise an error

eyazici90 · 2025-10-18T10:35:51Z

Could u help me with review @SuperQ ?

Spaceman1701

Did a more detailed read through the code and didn't find that concerns me.

I think this makes a lot of sense, and it's very bad behavior to hang on resolvePeers right now.

I am a little nervous about this change since it essentially adds a new exit condition, but I think 15 seconds is more than long enough to do a few DNS resolves. I guess if other people are also concerned, we could bump it to 30s by default, but that seems extremely long.

SuperQ · 2025-11-06T17:35:03Z

Looks like a couple tests need fixing.

Given resolvePeers is context aware, yet, uses background that may lead to blocking call. This is due to the fact that `LookupIPAddr` is called under the hood that does dns look up which may be blocked in some platforms. In case of dns lookups are blocked may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as readiness check to fail. Signed-off-by: emreya <e.yazici1990@gmail.com>

Signed-off-by: emreya <e.yazici1990@gmail.com>

eyazici90 · 2025-11-06T18:52:26Z

I guess if other people are also concerned, we could bump it to 30s by default, but that seems extremely long.

I am fine to change it as default: 30s, however, imo 15s is already enough as default.

eyazici90 · 2025-11-06T18:53:37Z

Looks like a couple tests need fixing.

Fixed @SuperQ

grobinson-grafana reviewed Apr 4, 2025

View reviewed changes

eyazici90 marked this pull request as ready for review April 9, 2025 08:52

eyazici90 force-pushed the main branch from 693f516 to 3736c89 Compare April 17, 2025 13:48

Spaceman1701 approved these changes Nov 6, 2025

View reviewed changes

eyazici90 force-pushed the main branch 2 times, most recently from e1138e9 to 01a546b Compare November 6, 2025 18:26

eyazici90 added 3 commits November 6, 2025 19:49

docs: add cluster.peers-resolve-timeout to README for cluster setup

c6eaa29

Signed-off-by: emreya <e.yazici1990@gmail.com>

test: fix tests after rebase

7747a51

Signed-off-by: emreya <e.yazici1990@gmail.com>

eyazici90 force-pushed the main branch from 01a546b to 7747a51 Compare November 6, 2025 18:49

SuperQ approved these changes Nov 6, 2025

View reviewed changes

SuperQ merged commit 1fe77cf into prometheus:main Nov 6, 2025
7 checks passed

SoloJacobs mentioned this pull request Nov 24, 2025

Release v0.30.0-rc.0 #4770

Merged

fix: set context timeout for resolvePeers #4343

fix: set context timeout for resolvePeers #4343

Uh oh!

Conversation

eyazici90 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Uh oh!

eyazici90 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grobinson-grafana Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

eyazici90 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

grobinson-grafana Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eyazici90 Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

eyazici90 commented Oct 18, 2025

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

Uh oh!

SuperQ commented Nov 6, 2025

Uh oh!

eyazici90 commented Nov 6, 2025

Uh oh!

eyazici90 commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eyazici90 commented Apr 3, 2025 •

edited

Loading

eyazici90 commented Apr 3, 2025 •

edited

Loading

grobinson-grafana Apr 9, 2025 •

edited

Loading