[FEA] Digital Fingerprinting Pipeline Updates for new DS Features #208
Closed
Description
- An anomaly score is calculated every time a new log becomes available
- The derived features are updated every time a new log becomes available
- We will train models for each user with all the data. We will then run inference with the training data only to save the mean of the reconstruction loss and the standard deviation of the reconstruction loss for that user and store it. This value will be used in inference. Every time there’s a new reconstruction loss, we calculate a z-score with that stored mean and std. This result will give the anomaly metric we use for each log. Some of the features will be used as they are. Some features will be derived from the logs or other columns.
- Features will be scaled with the standard scaler.
- For each user, logs are used in the order they are generated for the sanity of the derived features’ impact.
- If the inference has to be run in batches (instead of real-time streaming), we need to make sure the derived features have already been updated for each user for every log before they are used.
- Users should be able to use two thresholds for DUO and two thresholds for Azure logs.
- When the lower threshold** is exceeded(potentially due to non-actionable True positives) the SOC team or analysts can choose to trigger retraining for that user automatically.
- When the higher threshold*** is exceeded SOC team can choose to investigate for potential actionable True Positives.
- As an alternative, the users can monitor the top K**** anomalies in a rolling window. Past hour or past 24 hours***** etc.
- For new joiners, a generic model should be trained for that manager or org, and until the user accumulates a number of logs******, they keep using the generic/aggregate model
- Granularity should be customisable.
Metadata
Assignees
Labels
Type
Projects
Status
Done