Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-silent on insignificant alarms #346

Closed
stkonst opened this issue Mar 30, 2020 · 2 comments · Fixed by #373
Closed

Auto-silent on insignificant alarms #346

stkonst opened this issue Mar 30, 2020 · 2 comments · Fixed by #373
Assignees
Labels
automation database docs enhancement New feature or request frontend help wanted Extra attention is needed logging p/medium Medium priority
Milestone

Comments

@stkonst
Copy link

stkonst commented Mar 30, 2020

Hi guys,

We have quite some alarms on artermis appearing where peers_seen =< 2 and AS_affected <= 3

Those are very insignificant alarms and would never bother ourselves to chase them. Thus, is it possible to have a feature on Artemis where those alarms could be skipped/auto-ignored or disappear automatically? I am looking forward for your feedback and I can provide examples if needed.

Thank 's

Stavros
AMS-IX NOC

@stkonst stkonst changed the title Auto-silent low criticality of alarms Auto-silent on insignificant alarms Mar 30, 2020
@vkotronis vkotronis self-assigned this Apr 5, 2020
@vkotronis vkotronis added this to the release-1.4.1 milestone Apr 5, 2020
@vkotronis
Copy link
Member

@stkonst This is a good idea!

However, let's discuss a bit the logic so that we can integrate this into the tool properly. According to the hijack states wiki page we have some defined states that a hijack alert goes through (i.e., a "life-cycle"). Do you mean that if the seen peers and/or the infected ASes stay below user-defined thresholds (e.g., 2 and 3 respectively) for a user-defined interval (e.g., 1 hour), they should enter a non-active state, e.g., dormant or insignificant? The user-provided parameters could be provided via the .env file (as the rest of them), therefore to implement this we would need some kind of cleaner robot program (we have done sth similar for the deletion of old BGP updates and making hijacks dormant) that goes through the DB-stored alerts and re-characterizing them. Maybe we could use the "ignored" tag.

My problem with this is that if another BGP update related to the ignored (or auto-silenced) alert comes in we would have to generate a new alert and repeat the process.

Could you maybe provide some examples here for the 3 parameters (seen peers, infected ASes and no-change-interval) but without sharing any private information if possible? It would be interesting to see for how much time after detection you actually get BGP updates, even though they point to the same (or similar) small number of infected ASes and/or seen peers.

We could also continue this discussion on bgpartemis.slack.com for more details.
Thanks for reporting this; I think we could implement sth like this but it would be best to coordinate on potential test cases (and see a few practical examples) before attempting to alter the life-cycle of new alerts. We should also keep in mind that alerts follow a "ramp up - peak - ramp down" phase; we should not auto-silence alerts that start slow but become critical afterwards (or generate more than 1 alerts in that case).

@vkotronis
Copy link
Member

vkotronis commented May 4, 2020

most viable solution I think with the current requirements:

  1. new .env variables:
    AUTO_IGNORE_NUM_ASES_INFECTED (threshold for number of infected ASes, default=0)
    AUTO_IGNORE_NUM_PEERS_SEEN(threshold for number of seen peers, default=0)
    AUTO_IGNORE_INTERVAL (when the thresholds will be verified and auto-ignore will happen)
  2. Implement sth similar with here , however, set the alerts to ignored and also clean up redis (see ignore workflow at the DB module) If the redis cleanup cannot be implemented at the postgres entrypoint, we can use the clock/scheduler microservice and do this periodically (e.g., every minute).

Workflow: set as ignored all alerts for which it holds that either their infected ASes or their seen peers are below the respective thresholds for more than the auto-ignore interval (clean up redis and update DB). Will form a draft on this after I discuss with Stavros.

So changes need to take place at the following places:

  • scheduler (send clock signal)
  • database (receive clock signal, clean up redis, update DB)
  • env (plus k8s)
  • wiki (to explain the new env vars and their utility)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation database docs enhancement New feature or request frontend help wanted Extra attention is needed logging p/medium Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants