morpheus-experimental/asset-clustering at branch-25.02 · nv-morpheus/morpheus-experimental

History

Name		Name	Last commit message	Last commit date
parent directory ..
datasets		datasets
models		models
training-tuning-inference		training-tuning-inference
README.md		README.md

README.md

Asset Clustering using Windows Event Logs

Use Case

Cluster assets into various groups based on Windows Event Logs data.

Version

1.0

Model Overview

The model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.

Model Architecture

There are two clustering algorithms available:

DBSCAN which stands for Density-Based Spatial Clustering of Applications with Noise.
KMeans Input features to the model are derived from the windows event logs wherein various facets of login events, type of logon event, number of usernames associated with a host etc.., are aggregated.

Requirements

An environment based on Rapids is required to run the scripts and python notebook provided. Also on top of that the additional requirements can be installed into the environment via the supplementary requirements file provided.

pip install -r requirements.txt

Training

Training data

In this project we use the publicly available Unified Host and Network Data Set[1] dataset from the Advanced Research team in Cyber Systems of the Los Alamos National Laboratory (lanl) to demonstrate various aspects involved in clustering assets in a given network. The lanl dataset consists of netflow and windows event log (wls) files for 90 days. For this project we focus solely on the windows event log files which use the naming convention wls_day-01.bz2, wls_day-02.bz2,..., wls_day-90.bz2. The training data uses first ten days of data i.e. wls_day-01.bz2,..., wls_day-10.bz2. Note that for purposes of scale and quick reproducibility, we use only first ten days of data to experiment. One can easily use more data by changing the input file suffix. Refer to experiment.ipynb for more details. These ten days' data is pre-processed and the features are aggregated. The resulting dataset contains 14044 hosts and is saved in datasets/host_agg_data_day-01_day-10.csv.

Training parameters

The following parameters are chosen in training for the DBSCAN algorithm:

$\epsilon=0.0005$
Manhattan distance as the metric i.e. Minkowski distance with $p=1$.

Model accuracy

clusters found = 9 (+1 cluster for for the noisy samples) Silhouette score = 0.975

Training script

To train the model run the following script under working directory.

cd ${MORPHEUS_EXPERIMENTAL_ROOT}/asset-clustering/training-tuning-inference
# Run training script and save models
python train.py --model dbscan

This saves trained model files under ../models directory. Then the inference script can load the models for future inferences.

Inference Input

python inference.py --model dbscan

When the above command is executed, dbscan clustering is performed on the windows event logs data from days 11 to 15. This data is pre-processed and aggregated to a validation dataset which can be found at datasets/host_agg_data_day-11_day-15.csv. This contains a total of 12606 hosts. One can similarly run inference using KMeans clustering model.

python inference.py --model kmeans

Inference Output

The clustering of the 12606 hosts is performed and the count of each cluster is printed to stdout.

Ethical considerations

N/A

References

[1]. M. Turcotte, A. Kent and C. Hash, “Unified Host and Network Data Set”, in Data Science for Cyber-Security. November 2018, 1-22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asset-clustering

asset-clustering

README.md

Asset Clustering using Windows Event Logs

Use Case

Version

Model Overview

Model Architecture

Requirements

Training

Training data

Training parameters

Model accuracy

Training script

Inference Input

Inference Output

Ethical considerations

References

Files

asset-clustering

Directory actions

More options

Directory actions

More options

Latest commit

History

asset-clustering

Folders and files

parent directory

README.md

Asset Clustering using Windows Event Logs

Use Case

Version

Model Overview

Model Architecture

Requirements

Training

Training data

Training parameters

Model accuracy

Training script

Inference Input

Inference Output

Ethical considerations

References