Azure active directory (Azure-AD) is an identity and access management service, that helps users to access external and internal resources such as Office365, SaaS applications. The Sign-in logs in Azure-AD log identifies who the user is, how the application is used for the access, and the target accessed by the identity 1. On a given time 𝑡, a service 𝑠 is requested by user 𝑢 from device 𝑑 using the authentication mechanism of 𝑎 to be either allowed or blocked. For detailed explanation refer to the following page of the Microsoft azure-sign-in
Similar works on anomalous authentication detection include applying blackbox ML models on handcrafted features extracted from authentication logs or rule-based models. This workflow closely follows the success of heterogenous GNN embedding on cyber applications such as fraud detection [2,5], cyber-attack detection on prevalence dataset [3]. Unlike earlier models, this work uses a heterogenous graph for authentication graph modeling and relational GNN embedding for capturing relations among different entities. This allows us to take advantage of relations among users/services, and at the same time avoids the feature-extracting phase. In the end, the model learns both from the structural identity and unique feature identity of individual users.
- https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/concept-sign-ins
- Liu, Ziqi, et al. “Heterogeneous Graph Neural Networks for Malicious Account Detection.” arXiv [cs.LG], 27 Feb. 2020, https://doi.org/10.1145/3269206.3272010. arXiv.
- Lv, Mingqi, et al. “A Heterogeneous Graph Learning Model for Cyber-Attack Detection.” arXiv [cs.CR], 16 Dec. 2021, http://arxiv.org/abs/2112.08986. arXiv.
- Schlichtkrull, Michael, et al. "Modeling relational data with graph convolutional networks." European semantic web conference. Springer, Cham, 2018 https://arxiv.org/abs/1703.06103
- Rao, Susie Xi, et al. "xFraud: explainable fraud transaction detection." Proceedings of the VLDB Endowment 3 (2021) https://www.vldb.org/pvldb/vol15/p427-rao.pdf
- Powell, Brian A. "Detecting malicious logins as graph anomalies." Journal of Information Security and Applications 54 (2020): 102557
It uses a heterogeneous graph representation as input for RGCN. Since the input graph is heterogenous, an embedding for target node "authentication" is used for training the RGCN classifier. The model is trained as a binary classifier with the task to output "success" or "failure" to each authentication embedding.
Architecture Type:
- Graph Neural Network
Network Architecture:
- 2-layer RGCN with 8 dimension output embedding
Authentication data with nodes including user, authentication, device, and service.
Input Parameters:
- None
Input Format:
- JSON format
Other Properties Related to Output:
- None
- An anomalous score of authentication indicates a probability score of being an anomaly. A threshold of e.g 0.49 could be used to output produce "benign"
or "fraudulent" authentication.
Output Parameters:
- None
Output Format:
- CSV (scores & authenticationId)
Other Properties Related to Output:
- None
Runtime(s):
- Pytorch
- DGL
Supported Hardware Platform(s):
- Ampere/Turing
Supported Operating System(s):
- Linux
1.0
Link:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- A training data consists of 1992 authentication event, with a label indicating either failure or success. The dataset is simulated to resemble Azure-AD sign on events.
Dataset License:
Link:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- A evaluation data consists of 235 authentication event, with a label indicating either failure or success.
Dataset License:
Engine:
- Pytorch
Test Hardware:
- Other (Not Listed)
- Not Applicable
- Not Applicable
- Not Applicable
- English: 100%
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
Individuals from the following adversely impacted (protected classes) groups participate in model design and testing.
- Not Applicable
- Not Applicable
- The model is primarily designed for testing purposes and serves as a small pretrained model specifically used to evaluate and validate the RGCN model. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing.
- This model is intended for developers that want to build and/or customize Relational graph neural network (RGCN) for authentication detection.
- The intended beneficiaries of this model are developers who aim to test the performance and functionality of the RGCN pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world Azure-log analysis.
- This model outputs an anomalous score of authentication indicates a probability score of being an anomaly. A threshold of e.g 0.49 could be used to output produce "benign" or "fraudulent" authentication.
- An Azure-AD sign-in dataset it includes four types of nodes, authentication, user, device and service application nodes are used for modeling. This model shows an application of a graph neural network for anomalous authentication detection in Azure-AD sign-in using heterogeneous graph. A Relational graph neural network (RGCN) is used to identify anomalous authentications.
Name the adversely impacted groups (protected classes) this has been tested to deliver comparable outcomes regardless of:
- Not Applicable
- This model version is trained on a simulated Azure-AD sign-on logs schema, with entities (user, service, device, authentication) and "statsFlag" as requirements. Data lacking the required features or requiring a different feature set may not be compatible with the model.
- The model is evaluated using Area under ROC curve and accuracy for authentications.
- None
- None
- No
- Not Applicable
- Not Applicable (synthetically generated)
- Anomalous azure authentication detection
- This model version requires Azure-AD sign-on logs schema, with entities (user, service, device, authentication) and "statsFlag" as requirements, the primary application for this model is for testing the pipeline.
- No
- None
- No
- No
- No
- No
- No
- Neither
- The synthetic data used in this model is generated using the faker python package. The user agent field is generated by faker, which pulls items from its own dataset of fictitious values (located in the linked repo). Similarly, the event source field is randomly chosen from a list of event names provided in the Azure log dataset. There are no privacy concerns or PII involved in this synthetic data generation process.
Protected classes used to create this model? (The following were used in model the model's training:)
- Not applicable
- The dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for any changes.
- No (as the dataset is fully synthetic)
- Not Applicable (no PII collected)
- No
- No
- No
- No
- Yes, training dataset
- Not Applicable
Is data compliant with data subject requests for data correction or removal, if such a request was made?
- Not Applicable (as data is synthetic)