Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alertmanager messages lost due to high availability mode #4179

Open
tianshimoyi opened this issue Dec 23, 2024 · 2 comments
Open

alertmanager messages lost due to high availability mode #4179

tianshimoyi opened this issue Dec 23, 2024 · 2 comments

Comments

@tianshimoyi
Copy link

What did you do?

alertmanager irregular loss recovery message notification.

What did you expect to see?
alertmanager messages are not lost。

What did you see instead? Under which circumstances?
截屏2024-12-23 17 36 42

I deployed a cluster of 4 alertmanager nodes. The position of the alertmanager node that changes the picture printing log is 3, so it will wait 45s and then send the message. As shown in the picture, while he was sending the message, a new alarm entry came. Alertmanager No. 0 then sends a new alert entry, and then synchronizes the information to other nodes. When alertmanager No. 3 sent the message, he saw the new alarm, so he sent the old recovery, and then synchronized the old recovery to other alertmanager nodes, causing the recovery notification of the new alarm entry to fail to be sent.

  • Alertmanager version:
    v0.24.0

  • Logs:

ts=2024-12-21T06:13:49.106Z caller=dispatch.go:549 level=debug component=dispatcher aggrGroup="{}/{alertIndex!=\"\",alertObject!=\"\",business=\"tcloud-openplatform\",repeattime=\"ten_minutes\",serviceType!=\"\",tmplId!=\"\"}:{alertIndex=\"0\", alertgroup=\"cloudapp.biz-recsys.feature-updater.wg-binlog-live-qingmiao.327dbe35-1f5b.prod\", alertname=\"error_log\", serviceType=\"cloudApp\"}" msg=flushing alerts=[error_log[f6c3f84][resolved]]
ts=2024-12-21T06:13:49.106Z caller=dispatch.go:169 level=debug component=dispatcher msg="Received alert" alert=error_log[f6c3f84][active]
ts=2024-12-21T06:13:49.138Z caller=dispatch.go:169 level=debug component=dispatcher msg="Received alert" alert=error_log[f6c3f84][active]
ts=2024-12-21T06:14:49.105Z caller=dispatch.go:169 level=debug component=dispatcher msg="Received alert" alert=error_log[f6c3f84][resolved]
@tianshimoyi
Copy link
Author

Should we also add the start time of the alarm entry in the hashAlert function to distinguish whether it is a new alarm?

func hashAlert(a *types.Alert) uint64 {
	const sep = '\xff'

	hb := hashBuffers.Get().(*hashBuffer)
	defer hashBuffers.Put(hb)
	b := hb.buf[:0]

	names := make(model.LabelNames, 0, len(a.Labels))

	for ln := range a.Labels {
		names = append(names, ln)
	}
	sort.Sort(names)

	for _, ln := range names {
		b = append(b, string(ln)...)
		b = append(b, sep)
		b = append(b, string(a.Labels[ln])...)
		b = append(b, sep)
	}
	//Add a start time to distinguish if it is a new alarm?
	b = append(b, sep)
	b = append(b, a.StartsAt.String()...)
	hash := xxhash.Sum64(b)

	return hash
}

@tianshimoyi
Copy link
Author

@grobinson-grafana Could you please help me? Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant