The model uses a graph clustering approach (cited below) which assigns each host present in the dataset to a cluster based on
- Aggregated and derived features from sflow Logs of that particular host
- The host connectivity to adjacent assets in the graphical representation (derived from sflow logs)
[1]. H. Zhang, P. Li, R. Zhang and X. Li, "Embedding Graph Auto-Encoder for Graph Clustering," in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2022.3158654.
The model architecture was proposed in the EGAE paper below (cited). Inputs of EGAE consist of two parts, graph and features. After encoding, data are mapped into a latent feature space as part of the encoder module. There are two decoder modules:
- Decoder for clustering: Relaxed k-means is embedded into GAE to induce it to generate preferable embeddings.
- Decoder for Graph : Optimize (minimize) reconstruction error
Architecture Type:
- Graph Neural Network
Network Architecture:
- Graph Autoencoder with 2-layers
- The input is Sflow data from ~3000 devices
Armis device and application data
Input Parameters:
- None
Input Format:
- CSV format
Other Properties Related to Output:
- None
- Clustering information and cluster membership
Output Parameters:
- None
Output Format:
- CSV
Runtime(s):
- cupy
Supported Hardware Platform(s):
- Ampere/Turing
Supported Operating System(s):
- Linux
1.0
Link:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- The dataset uses Sflow data to come up with a graph representation where each node in the graph is an asset. Since sflow data is directional, we use 'source' as the target asset. The feature matrix for this asset is created using derived and aggregated features from sflow data and armis data. The adjacency matrix is derived using the graph representation of the devices from sflow data. Each row in the resulting dataset is an asset and can be uniquely identified by the mac address. All information in the Sflow is obfuscated to remove any private information
Dataset License:
Link:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- Subset of the simulated and obfuscated Sflow
Dataset License:
Engine:
- Pytorch
Test Hardware:
- Other (Not Listed)
- Not Applicable
- Not Applicable
- Not Applicable
- English (100%)
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
- The model is primarily designed for testing purposes and serves as a small pretrained model specifically used to evaluate and validate asset clustering application using GNN.
- This model is intended for developers that want to build asset clustering application using GNN.
- The intended beneficiaries of this model are developers who aim to test the performance and functionality of the asset clustering application pipeline using sflow datasets.
- This model outputs cluster membership of devices based on sflow activities.
- The model architecture was proposed in the EGAE paper [1]. Inputs of EGAE consist of two parts, graph and features. After encoding, data are mapped into a latent feature space as part of the encoder module. There are two decoder modules.
- Decoder for clustering: Relaxed k-means is embedded into GAE to induce it to generate preferable embeddings.
- Decoder for Graph : Optimize (minimize) reconstruction error
Name the adversely impacted groups (protected classes) this has been tested to deliver comparable outcomes regardless of:
- Not Applicable
- This model requires feature engineered Sflow activity data along ARMIS device enrichment.
- Silhouette plot and score
- Not Applicable
- None
- No
- None
- No
- Typically used to cluster ARMIS devices in network based on Sflow activities.
- The model is trained in the format of Sflow dataset schema, the model might not be suitable for other applications.
- No
- Not Applicable
- Not Applicable
- Not Applicable
- No
- No
- No
- Neither
- Not Applicable, The synthetic data used in this model is generated using the faker python package. The device information field is generated by faker, which pulls items from its own dataset of fictitious values (located in the linked repo). There are no privacy concerns or PII involved in this synthetic data generation process.
Protected classes used to create this model? (The following were used in model the model's training:)
- Not applicable
- The dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for any changes.
- No (as the dataset is fully synthetic)
- Not Applicable (no PII collected)
- No
- No
- Yes at (Dataset)
- Not applicable
Is data compliant with data subject requests for data correction or removal, if such a request was made?
- Not applicable