Skip to content

Conversation

@ABHISHEK-DBZ
Copy link

  • Add troubleshooting section with common issues and solutions
  • Include cluster connectivity problems and DNS resolution timeouts
  • Add guidance for alerts/notifications not working
  • Include memory usage and configuration reload issues
  • Provide practical examples and commands for debugging

This helps users quickly resolve common operational issues without needing to search through multiple documentation sources.

ABHISHEK-DBZ and others added 2 commits November 7, 2025 22:00
- Add troubleshooting section with common issues and solutions
- Include cluster connectivity problems and DNS resolution timeouts
- Add guidance for alerts/notifications not working
- Include memory usage and configuration reload issues
- Provide practical examples and commands for debugging

This helps users quickly resolve common operational issues without
needing to search through multiple documentation sources.

Signed-off-by: abhishek-dbz <abhibro936@gmail.com>
In 92ecf8b silence_bench_test.go was
left behind since it's not run automatically, and started failing.

Fix by passing a new registry when creating Silences.

Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Co-authored-by: Guido Trotter <guido@hudson-trading.com>
Signed-off-by: abhishek-dbz <abhibro936@gmail.com>
@ABHISHEK-DBZ ABHISHEK-DBZ force-pushed the docs/add-troubleshooting-section branch from 5f4d4ab to 4fbc391 Compare November 7, 2025 16:31
Copy link
Contributor

@ultrotter ultrotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's useful! It might be worth considering also adding information about what metrics to put in a dashboard or monitoring about alertmanager itself.

**Solutions:**
- Check for alert storms - large number of unique alert groups
- Review `group_by` labels in routing configuration
- Consider using more specific grouping to reduce alert group count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this better read "broader", since it sounds like if you go for more specific, you'll get more groups, not fewer?


**Solutions:**
- Check for alert storms - large number of unique alert groups
- Review `group_by` labels in routing configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can possibly remove this line which doesn't specify how to review them, and merge them with the one below

@siavashs
Copy link
Contributor

I'd suggest this move to the docs and not the README.
Then it can be part of https://prometheus.io/docs/guides/


#### Cluster peers not connecting

**Symptoms:** Alertmanager instances cannot discover each other in cluster mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to add a sentence to detect that this is the case. Eg from logs or from the status-page peer list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants