This is the code repository for the paper 'AlertBERT: A noise-robust alert grouping framework for simultaneous attacks' (publication pending).
AlertBERT is a state-of-the-art self-supervised alert grouping method, that is based on masked-language-models to obtain alert embeddings and agglomerative clustering to work under high levels of noise and simultaneous cyber attacks. For further details on the AlertBERT framework, we refer to our paper (publication pending).
To recreate the experiments reported in the paper, first, run the module alertbert.train_mlm to train a masked-language-model.
There the only parameters to adjust in MaskedLangModelParams passed to the main function are the number of attention heads and the used configuration of AIT-ADS-A.
After completing the training of the masked-language-model, run the module alertbert.eval_groupingto compute the corresponding ROC-curves by providing the model_id to the model_config_generator function.
To view the results, please refer to the plots of ROC-curves provided in the notebooks in the roc_results directory.
These results should then look something like the following ROC-plot obtained on the simul-attacks configuration of AIT-ADS-A.
The individual modules in alertbert have the following purposes:
-
alertbert.aitadsprovides the dataset, -
alertbert.eval_groupingimplements the evaluation of alert groupings, -
alertbert.eval_mlmimplements the evaluation of the training of masked-language-models, -
alertbert.model_eval_utilsprovides evaluation utilities, -
alertbert.modelsimplements the alert grouping models, -
alertbert.preprocessingimplements vocabularies and tokenisation of the data, -
alertbert.train_mlmperforms the training of masked-language-models, and -
alertbert.utilsprovides general utilities.
To run the experiments please follow these steps:
To set up the python environment, run pip install -r requirements.txt.
Furthermore, it is necessary to install the graph-tools library, which is not available through pip.
For our experiments we use the AIT-ADS-A dataset, which is an augmented version of the AIT Alert Dataset.
The files to construct this dataset are part of this repository and should be present in the aitads_augmented/data directory.
If this is the case, the setup is complete and the remaining steps are not necessary!
To build the dataset from source, follow these steps:
-
Download and unzip the three datasets into their respective directories listed below.
After this step thealerts_jsondirectory should contain the filesscenario_aminer.jsonandscenario_wazuh.jsonfor each of the eight scenarios of AIT-ADS.- AIT Alert Dataset ➡️
alerts_json - AIT Log Dataset V2.0 ➡️
aitldsv2 - AIT Netflow Dataset ➡️
aitnds
- AIT Alert Dataset ➡️
-
Run
preprocess.py.
This script will read the information of the three datasets, use it to assign the labels to the alerts in AIT-ADS, and save the labels to the filesalerts_csv/scenario_alerts.csvfor each scenario.
At this point, for each scenario we have the following situation:
The files alerts_json/scenario_wazuh.json and alerts_json/scenario_aminer.json contain the alert data sorted by timestamp, but separately for the two IDSs.
And the files alerts_csv/scenario_alerts.csv contain the labels for the alerts, but there the alerts are ordered so that they correspond to a concatenation of alerts_json/scenario_wazuh.json and alerts_json/scenario_aminer.json.
Thus, the next step:
-
Run
unite_alerts_labels.py.
This script will simplify the situation described above by combining all the alerts and their labels, sorted by timestamps, into the filesalerts_json/scenario.jsonfor each scenario.
Additionally, the script will create the filesalerts_json/scenario_light.json, which have the same contents except for the raw alert data, and can be used to speed up loading the data if the raw alerts are not required. -
Finally, to create the files necessary for building AIT-ADS-A run
build_augment_files.py. For further information regarding the setup of AIT-ADS-A please refer to the README file in theaitads_augmenteddirectory.
- Modules
alertbert/module.pyhave to be run viapython -m alertbert.module. - The documentation of the code can be found in the respective docstrings of functions, classes, and modules.
