String Resemblance Grouping (SRG) is designed to find a subset of representative strings within a a large collection of messages. These representative strings create groupings with which to categorize the messages for further exploration or triage.
1.0
SRG requires an environment set-up to use Rapids.
- Problem Background
- Use Case
- Technique Overview
- Model Overview
- Training Input
- Inference Input
- Inference Output
- Future Work
- References
When approaching the problem of categorizing computer logs into groups with an assigned representative, there are two major considerations: the run time of the algorithm and hyperparameter selection. When confronted with millions of log entries with such a large number being unique, the primary approach for many of these data sets is reactive analysis: a problem has emerged in the network and the data is searched for relevant information to resolve the issue. What is being proposed here is a way to proactively approach the data for situational awareness and potentially uncovering problems in the network that current heuristics and approaches have not discovered. The large volume of these logs necessitates run time complexity less than
The second consideration is one of hyperparameters. In most clustering approaches, the number of clusters,
These two considerations drive many of the design decisions of the SRG approach.
SRG is agnostic to log type. It can be trained over a single log source or multiple log sources in a single data set. Models can be trained and fit over the same set to provide immediate insight into a given set or alternatively a model can be trained and saved to categorize an ongoing flow of log messages.
The breadth of literature on string resemblance provides a good starting point to solve the problem at hand. So the primary focuses of the problem become the time complexity and hyperparameter selection as discussed in the Problem Background. This means that the approach explored tries to balance time complexity with data driven hyperparameter selection. For a large number of clustering algorithms, the number of clusters,
In order to keep the time complexity low when selecting the number of clusters, SRG works by trying to subset the logs based on different parameters. The number of resulting disjoint subsets becomes the number of representatives,
Next the strings are shingled by either
The last piece of subsetting can be applied using domain knowledge about the specific network logs being analyzed, such as pre-grouping HTTP URL's based on the returned status code. Metadata associated with the logs are typically correlated with the log and logs with similar metadata can be more similar to each other than to strings with different metadata. This can provide more focused situational awareness and better clustering when domain knowledge can be leveraged for the network logs.
A benefit to this approach is that instead of fixing the number of groups,
The model stores the representatives and the means for assigning new messages to a representative group.
A collection of logs. These can be in a text file to be loaded into a Rapids cudf or an already loaded collection.
A single log instance or collection of logs or text files containing logs.
The representative and a corresponding numeric label for the log instance or a dataframe containing the original logs and the assigned representative and numeric group.
See this notebook for an example on building and inferencing a SRG model.
Currently SRG representatives and groups are output as the final result. Future work will instead leverage these representatives as initial "centroids" in a
Further work will look into bootstrapping the length and FastMap 1-D grouping using ensemble clustering, which is an approach that finds a metacluster label for data that has multiple clustering labels assigned to each data point. The benefit to this, especially in the FastMap grouping, is to smooth out the variance in the 1-D grouping. The impact to runtime is offset by the fact that all of the 1-D groupings can be run simultaneously. This means that the largest impact to the added runtime is just from the ensemble clustering algorithm.
- https://dl.acm.org/doi/abs/10.1145/568271.223812
- https://repository.upenn.edu/cgi/viewcontent.cgi?article=1615&context=statistics_papers
String Resemblance Grouping (SRG) is designed to find a subset of representative strings within a a large collection of messages. These representative strings create groupings with which to categorize the messages for further exploration or triage. This particular model was built using Windows log data.
- https://dl.acm.org/doi/abs/10.1145/568271.223812
- https://repository.upenn.edu/cgi/viewcontent.cgi?article=1615&context=statistics_papers
Architecture Type:
- Not Applicable (N/A)
Network Architecture:
- None
Input Format:
- String
Input Parameters:
- None
Other Properties Related to Output:
- None
Output Format:
- Cluster label and cluster representative
Output Parameters:
- None
Other Properties Related to Output:
- None
Runtime(s):
- Not Applicable (N/A)
Supported Hardware Platform(s):
- All
Supported Operating System(s):
- Linux
- 20230627
Link:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- A collection of 114535 Windows logs
Dataset License:
- Owned and hosted by Zenodo
Link:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
- A collection of 114535 Windows logs
Dataset License:
- Owned and hosted by Zenodo
Engine:
- Other (Not Listed)
Test Hardware:
- Other (Not Listed)
- Not Applicable
- Not Applicable
- Not Applicable
- English: 100%
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
- Not Applicable
Individuals from the following adversely impacted (protected classes) groups participate in model design and testing.
- Not Applicable
- Not Applicable
- This model is intended to be used to syntactically cluster Windows logs.
- This model is intended for developers that want to build and/or customize syntactic clusters or groupings of a collection of logs.
- This model is intended for anyone that wants to syntactically cluster Windows logs for data insight or triage.
- This model outputs a cluster label and the corresponding cluster representative.
The breadth of literature on string resemblance provides a good starting point to solve the problem at hand. So the primary focuses of the problem become the time complexity and hyperparameter selection as discussed in the Problem Background. This means that the approach explored tries to balance time complexity with data driven hyperparameter selection. For a large number of clustering algorithms, the number of clusters,
In order to keep the time complexity low when selecting the number of clusters, SRG works by trying to subset the logs based on different parameters. The number of resulting disjoint subsets becomes the number of representatives,
Next the strings are shingled by either
The last piece of subsetting can be applied using domain knowledge about the specific network logs being analyzed, such as pre-grouping HTTP URL's based on the returned status code. Metadata associated with the logs are typically correlated with the log and logs with similar metadata can be more similar to each other than to strings with different metadata. This can provide more focused situational awareness and better clustering when domain knowledge can be leveraged for the network logs.
A benefit to this approach is that instead of fixing the number of groups,
Name the adversely impacted groups (protected classes) this has been tested to deliver comparable outcomes regardless of:
- Not Applicable
- Windows log files that are too syntactically different from the training data or from different versions of Windows from the training set.
- Cluster spread (mean and standard deviation)
- None
- Familiarty with clustering techniques
- No
- N/A
- No
- This model is intended to be used to syntactically cluster Windows logs for data insight and triage.
- The model can only be used with Windows log data.
- No
- None
- No
- Yes
- No
- No
- No
- Neither
- N/A
Protected classes used to create this model? (The following were used in model the model's training:)
- None of the Above
- Annually
- Yes
- N/A
- No
- Yes
- Yes
- Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?
- No