Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delay alerting/"autoignore" until thresholds met #426

Open
smith8917 opened this issue Aug 17, 2020 · 3 comments
Open

delay alerting/"autoignore" until thresholds met #426

smith8917 opened this issue Aug 17, 2020 · 3 comments

Comments

@smith8917
Copy link

We could make use of an "autoignore" feature that is implemented in a way that reverses the current logic. Instead of "alert-but-ignore-if-not-critical-after-ramp-up", we'd like to see a "ignore-first, alert-if-critical-after-ramp-up" option.

The implementation would look much like a traditional monitoring system's threshold setup where the alerts (Slack, email, log, dashboard, etc) are only triggered once the threshold of "peers seen" and "ASes seen" have both been reached. You could "and" or "or" the two thresholds (or even better, make that an option for the user) to decide when to alert.

This would also render the "interval" variable unnecessary since you're not using a timer to determine whether ARTEMIS should alert. You might be able to make use of the this variable though - by setting it to "0", a user could enable this feature.

@slowr slowr added the enhancement New feature or request label Aug 23, 2020
@slowr
Copy link
Member

slowr commented Aug 23, 2020

Hey @smith8917, thanks for the feature request!

@vkotronis what do you thing about this?

@vkotronis
Copy link
Member

vkotronis commented Aug 24, 2020

@smith8917 thanks for reporting this!
@slowr regarding implementation, ARTEMIS was initially designed/implemented (versions 1.X, X=0,1,...) according to the following objectives:

  1. Detect a hijack as soon as possible (real time)
  2. Upon hijack detection, alert the operator ==> Hijack event == Hijack alert.

Therefore, holding a hijack alert will require some redesign of what ARTEMIS detection is doing. Namely, now we conflate the hijack events we store in PG with alerts. We would need separate tables for the events and the alerts; an event would accumulate data (as now), and the alert would only be generated when the corresponding thresholds are met. So in that case we still target objective number 1, but we modify significantly objective number 2. Need to think (also I would welcome your input on this), if we should do this for the next 1.Y version or for 2.0. What is more cumbersome are the frontend changes (at least as I see them) and the fact that the current hijacks table carries a lot of information that we need to move to the alerting table (this needs to be designed carefully). Assuming that this support is there, then on the backend I do not see any significant issues while implementing this logical change.

I think this feature request is valuable and results in a better behavior for ARTEMIS; we also avoid the synchronous (periodic) timer checks and make alerting async based on the ignore criteria. In case you have cycles right now please feel free to check it out, otherwise I will take a spin on it when I manage to get some holiday time out of my military service.

@vkotronis
Copy link
Member

Suggested workflow:

  1. Decide what info should be on alerts and what on hijack events
  2. Create separate table for alerts and adjust hijacks table
  3. Migrate data from hijacks to alerts
  4. Adjust the ARTEMIS backend to use the alerts table for anything related to notifications for the user
  5. Adjust the ARTEMIS frontend (inc. hasura gql calls) to display alerts and events
  6. Adjust the detection mechanism to alert only when the ignore criteria does not hold (directly be default - real time)
  7. Adjust all our tests to take into account this change (e2e testing + unit tests)
  8. Deploy this on a user's installation for testing for some time (a few days)
  9. Verify that the updated alerting mechanism works fine.

@slowr if you have more ideas on this feel free to extend/adjust this proposed workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants