-
Notifications
You must be signed in to change notification settings - Fork 2.4k
docs: add comprehensive troubleshooting section to README #4711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: add comprehensive troubleshooting section to README #4711
Conversation
- Add troubleshooting section with common issues and solutions - Include cluster connectivity problems and DNS resolution timeouts - Add guidance for alerts/notifications not working - Include memory usage and configuration reload issues - Provide practical examples and commands for debugging This helps users quickly resolve common operational issues without needing to search through multiple documentation sources. Signed-off-by: abhishek-dbz <abhibro936@gmail.com>
In 92ecf8b silence_bench_test.go was left behind since it's not run automatically, and started failing. Fix by passing a new registry when creating Silences. Signed-off-by: Guido Trotter <guido@hudson-trading.com> Co-authored-by: Guido Trotter <guido@hudson-trading.com> Signed-off-by: abhishek-dbz <abhibro936@gmail.com>
5f4d4ab to
4fbc391
Compare
ultrotter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's useful! It might be worth considering also adding information about what metrics to put in a dashboard or monitoring about alertmanager itself.
| **Solutions:** | ||
| - Check for alert storms - large number of unique alert groups | ||
| - Review `group_by` labels in routing configuration | ||
| - Consider using more specific grouping to reduce alert group count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this better read "broader", since it sounds like if you go for more specific, you'll get more groups, not fewer?
|
|
||
| **Solutions:** | ||
| - Check for alert storms - large number of unique alert groups | ||
| - Review `group_by` labels in routing configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can possibly remove this line which doesn't specify how to review them, and merge them with the one below
|
I'd suggest this move to the |
|
|
||
| #### Cluster peers not connecting | ||
|
|
||
| **Symptoms:** Alertmanager instances cannot discover each other in cluster mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to add a sentence to detect that this is the case. Eg from logs or from the status-page peer list.
This helps users quickly resolve common operational issues without needing to search through multiple documentation sources.