delay alerting/"autoignore" until thresholds met #426

smith8917 · 2020-08-17T16:22:08Z

We could make use of an "autoignore" feature that is implemented in a way that reverses the current logic. Instead of "alert-but-ignore-if-not-critical-after-ramp-up", we'd like to see a "ignore-first, alert-if-critical-after-ramp-up" option.

The implementation would look much like a traditional monitoring system's threshold setup where the alerts (Slack, email, log, dashboard, etc) are only triggered once the threshold of "peers seen" and "ASes seen" have both been reached. You could "and" or "or" the two thresholds (or even better, make that an option for the user) to decide when to alert.

This would also render the "interval" variable unnecessary since you're not using a timer to determine whether ARTEMIS should alert. You might be able to make use of the this variable though - by setting it to "0", a user could enable this feature.

slowr · 2020-08-23T18:19:05Z

Hey @smith8917, thanks for the feature request!

@vkotronis what do you thing about this?

vkotronis · 2020-08-24T14:26:35Z

@smith8917 thanks for reporting this!
@slowr regarding implementation, ARTEMIS was initially designed/implemented (versions 1.X, X=0,1,...) according to the following objectives:

Detect a hijack as soon as possible (real time)
Upon hijack detection, alert the operator ==> Hijack event == Hijack alert.

Therefore, holding a hijack alert will require some redesign of what ARTEMIS detection is doing. Namely, now we conflate the hijack events we store in PG with alerts. We would need separate tables for the events and the alerts; an event would accumulate data (as now), and the alert would only be generated when the corresponding thresholds are met. So in that case we still target objective number 1, but we modify significantly objective number 2. Need to think (also I would welcome your input on this), if we should do this for the next 1.Y version or for 2.0. What is more cumbersome are the frontend changes (at least as I see them) and the fact that the current hijacks table carries a lot of information that we need to move to the alerting table (this needs to be designed carefully). Assuming that this support is there, then on the backend I do not see any significant issues while implementing this logical change.

I think this feature request is valuable and results in a better behavior for ARTEMIS; we also avoid the synchronous (periodic) timer checks and make alerting async based on the ignore criteria. In case you have cycles right now please feel free to check it out, otherwise I will take a spin on it when I manage to get some holiday time out of my military service.

vkotronis · 2020-08-24T14:35:32Z

Suggested workflow:

Decide what info should be on alerts and what on hijack events
Create separate table for alerts and adjust hijacks table
Migrate data from hijacks to alerts
Adjust the ARTEMIS backend to use the alerts table for anything related to notifications for the user
Adjust the ARTEMIS frontend (inc. hasura gql calls) to display alerts and events
Adjust the detection mechanism to alert only when the ignore criteria does not hold (directly be default - real time)
Adjust all our tests to take into account this change (e2e testing + unit tests)
Deploy this on a user's installation for testing for some time (a few days)
Verify that the updated alerting mechanism works fine.

@slowr if you have more ideas on this feel free to extend/adjust this proposed workflow.

slowr added the enhancement New feature or request label Aug 23, 2020

vkotronis added alerting detection database automation backend frontend labels Aug 24, 2020

vkotronis mentioned this issue Aug 24, 2020

System-wide "autoignore" #424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delay alerting/"autoignore" until thresholds met #426

delay alerting/"autoignore" until thresholds met #426

smith8917 commented Aug 17, 2020

slowr commented Aug 23, 2020

vkotronis commented Aug 24, 2020 •

edited

Loading

vkotronis commented Aug 24, 2020

delay alerting/"autoignore" until thresholds met #426

delay alerting/"autoignore" until thresholds met #426

Comments

smith8917 commented Aug 17, 2020

slowr commented Aug 23, 2020

vkotronis commented Aug 24, 2020 • edited Loading

vkotronis commented Aug 24, 2020

vkotronis commented Aug 24, 2020 •

edited

Loading