-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add root cause localization transformer #4925
Changes from 44 commits
d5ee205
f727a79
92de1dc
798289c
f2e128d
51569e3
59c6e89
29216e0
69da330
e1c5432
3a1d1c5
8f97602
48123f4
c47302f
b07ad28
c729877
ebbdb0d
0d43b0d
5778eed
c9ed044
e440f25
feba6f4
686831c
ddc8a36
475ee8a
8d874ca
0674ab3
4c5b8fb
08d607c
c688233
98637db
4ff2ed1
c22ad50
ea7ddbe
547aef2
8d17c3c
66b614a
12e7e18
e213615
30915cd
fda4ec7
620ef58
ae5722f
7f89fea
42dcbc2
9893fad
c831e43
16f5b33
9cd8739
f80c200
7c1c348
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident incrementally. | ||
|
||
## Multi-Dimensional Root Cause Localization | ||
It's a common case that one measure are collected with many dimensions (*e.g.*, Province, ISP) whose values are categorical(*e.g.*, Beijing or Shanghai for dimension Province). When a measure's value deviates from its expected value, this measure encounters anomalies. In such case, operators would like to localize the root cause dimension combinations rapidly and accurately. Multi-dimensional root cause localization is critical to troubleshoot and mitigate such case. | ||
|
||
## Algorithm | ||
|
||
The decision based root cause localization method is unsupervised, which means training step is no needed. It consists of the following major steps: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "not needed" #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
updated #Resolved |
||
(1) Find best dimension which divides the anomaly and unanomaly data based on decision tree according to entropy gain and entropy gain ratio. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "anomalous and regular data"? #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
updated #Resolved |
||
(2) Find the top anomaly points for the selected best dimension. | ||
|
||
### Decision Tree | ||
|
||
[Decision tree](https://en.wikipedia.org/wiki/Decision_tree) algorithm chooses the highest information gain to split or construct a decision tree. We use it to choose the dimension which contributes the most to the anomaly. Following are some concepts used in decision tree. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's non-standard to omit articles here. Try something like "The Decision Tree algorithm chooses..." and "Below are some concepts used in decision trees" |
||
|
||
#### Information Entropy | ||
|
||
Information [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) is a measure of disorder or uncertainty. You can think of it as a measure of purity as well.The less the value , the more pure of data D. | ||
|
||
$$Ent(D) = - \sum_{k=1}^{|y|} p_k\log_2(p_k) $$ | ||
|
||
where $p_k$ represents the probability of an element in dataset. In our case, there are only two classed, the anomaly points and the normaly points. $|y|$ is the count of total anomalies. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "the regular points" #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
updated #Resolved |
||
|
||
#### Information Gain | ||
[Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) is a metric to measure the reduction of this disorder in our target class given additional information about it. Mathematically it can be written as: | ||
|
||
$$Gain(D, a) = Ent(D) - \sum_{v=1}^{|V|} \frac{|D^V|}{|D |} Ent(D^v) $$ | ||
|
||
Where $Ent(D^v)$ is the entropy of set points in D for which dimension $a$ is equal to $v$, $|D|$ is the total number of points in dataset $D$. $|D^V|$ is the total number of points in dataset $D$ for which dimension $a$ is equal to $v$. | ||
|
||
For all aggregated dimensions, we calculate the information for each dimension. The greater the reduction in this uncertainty, the more information is gained about D from dimension $a$. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. D should be in dollar signs? |
||
|
||
#### Entropy Gain Ratio | ||
|
||
Information gain is biased toward variables with large number of distinct values. A modification is [information gain ratio](https://en.wikipedia.org/wiki/Information_gain_ratio), which reduces its bias. | ||
|
||
$$Ratio(D, a) = \frac{Gain(D,a)} {IV(a)} $$ | ||
|
||
where intrinsic value(IV) is the entropy of split (with respect to dimension $a$ on focus). | ||
|
||
$$IV(a) = -\sum_{v=1}^V\frac{|D^v|} {|D|} \log_2 \frac{|D^v|} {|D|} $$ | ||
|
||
In out strategy, firstly, for all the aggration dimensions, we loop all the dimensions to find the dimension who's entropy gain is above mean entropy gain ration, then from the filtered dimensions, we select the dimension with highest entropy ratio as the best dimension. In the meanwhile, dimensions for which the anomaly value count is only one, we include it when calculation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "whose", "mean entropy gain ratio," #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
updated #Resolved |
||
|
||
> [!Note] | ||
> 1. As our algorithm depends on the data you input, so if the input points is incorrect or incomplete, the calculated result will be unexpected. | ||
> 2. Currently, the algorithm localize the root cause incrementally, which means at most one dimension with the values are detected. If you want to find out all the dimension that contributes to the anomaly, you can call this API recursively. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "recursively" -> "incrementally" #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think recursively can express the logic more clearly. Because if you want to get all the dimensions, you need to call the API again by updating the anomly entrance with the dimension and dimension value you obtained from last call. The logic ends when the result is not updated or no root cause can be found. #Resolved |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
using System; | ||
using System.Collections.Generic; | ||
using Microsoft.ML; | ||
using Microsoft.ML.TimeSeries; | ||
|
||
namespace Samples.Dynamic | ||
{ | ||
public static class LocalizeRootCause | ||
{ | ||
private static string AGG_SYMBOL = "##SUM##"; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is AGG_SYMBOL for here? I notice that on line 19, you have both AGG_SYMBOL and AggregateType.Sum, and that some of the points have AGG_SYMBOL passed in instead of strings like "DC1." Can you add a few comments explaining what AGG_SYMBOL is and why it is used? |
||
public static void Example() | ||
{ | ||
// Create a new ML context, for ML.NET operations. It can be used for | ||
// exception tracking and logging, as well as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Create an empty list as the dataset. The 'RootCauseLocalization' API does not | ||
// require training data as the estimator ('RootCauseLocalizationEstimator') | ||
// created by 'RootCauseLocalization' API is not a trainable estimator. The | ||
// empty list is only needed to pass input schema to the pipeline. | ||
var emptySamples = new List<RootCauseLocalizationData>(); | ||
|
||
// Convert sample list to an empty IDataView. | ||
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples); | ||
|
||
// A pipeline for localizing root cause. | ||
var localizePipeline = mlContext.Transforms.LocalizeRootCause(nameof(RootCauseLocalizationTransformedData.RootCause), nameof(RootCauseLocalizationData.Input)); | ||
|
||
// Fit to data. | ||
var localizeTransformer = localizePipeline.Fit(emptyDataView); | ||
|
||
// Create the prediction engine to get the root cause result from the input data. | ||
var predictionEngine = mlContext.Model.CreatePredictionEngine<RootCauseLocalizationData, | ||
RootCauseLocalizationTransformedData>(localizeTransformer); | ||
|
||
// Call the prediction API. | ||
DateTime timestamp = GetTimestamp(); | ||
var data = new RootCauseLocalizationData(timestamp, GetAnomalyDimension(), new List<MetricSlice>() { new MetricSlice(timestamp, GetPoints()) }, AggregateType.Sum, AGG_SYMBOL); | ||
|
||
var prediction = predictionEngine.Predict(data); | ||
|
||
// Print the localization result. | ||
int count = 0; | ||
foreach (RootCauseItem item in prediction.RootCause.Items) | ||
{ | ||
count++; | ||
Console.WriteLine($"Root cause item #{count} ..."); | ||
Console.WriteLine($"Score: {item.Score}, Path: {String.Join(" ",item.Path)}, Direction: {item.Direction}, Dimension:{String.Join(" ", item.Dimension)}"); | ||
} | ||
|
||
//Item #1 ... | ||
//Score: 0.26670448876705927, Path: DataCenter, Direction: Up, Dimension:[Country, UK] [DeviceType, ##SUM##] [DataCenter, DC1] | ||
} | ||
|
||
private static List<Point> GetPoints() | ||
{ | ||
List<Point> points = new List<Point>(); | ||
|
||
Dictionary<string, Object> dic1 = new Dictionary<string, Object>(); | ||
dic1.Add("Country", "UK"); | ||
dic1.Add("DeviceType", "Laptop"); | ||
dic1.Add("DataCenter", "DC1"); | ||
points.Add(new Point(200, 100, true, dic1)); | ||
|
||
Dictionary<string, Object> dic2 = new Dictionary<string, Object>(); | ||
dic2.Add("Country", "UK"); | ||
dic2.Add("DeviceType", "Mobile"); | ||
dic2.Add("DataCenter", "DC1"); | ||
points.Add(new Point(1000, 100, true, dic2)); | ||
|
||
Dictionary<string, Object> dic3 = new Dictionary<string, Object>(); | ||
dic3.Add("Country", "UK"); | ||
dic3.Add("DeviceType", AGG_SYMBOL); | ||
dic3.Add("DataCenter", "DC1"); | ||
points.Add(new Point(1200, 200, true, dic3)); | ||
|
||
Dictionary<string, Object> dic4 = new Dictionary<string, Object>(); | ||
dic4.Add("Country", "UK"); | ||
dic4.Add("DeviceType", "Laptop"); | ||
dic4.Add("DataCenter", "DC2"); | ||
points.Add(new Point(100, 100, false, dic4)); | ||
|
||
Dictionary<string, Object> dic5 = new Dictionary<string, Object>(); | ||
dic5.Add("Country", "UK"); | ||
dic5.Add("DeviceType", "Mobile"); | ||
dic5.Add("DataCenter", "DC2"); | ||
points.Add(new Point(200, 200, false, dic5)); | ||
|
||
Dictionary<string, Object> dic6 = new Dictionary<string, Object>(); | ||
dic6.Add("Country", "UK"); | ||
dic6.Add("DeviceType", AGG_SYMBOL); | ||
dic6.Add("DataCenter", "DC2"); | ||
points.Add(new Point(300, 300, false, dic6)); | ||
|
||
Dictionary<string, Object> dic7 = new Dictionary<string, Object>(); | ||
dic7.Add("Country", "UK"); | ||
dic7.Add("DeviceType", AGG_SYMBOL); | ||
dic7.Add("DataCenter", AGG_SYMBOL); | ||
points.Add(new Point(1500, 500, true, dic7)); | ||
|
||
Dictionary<string, Object> dic8 = new Dictionary<string, Object>(); | ||
dic8.Add("Country", "UK"); | ||
dic8.Add("DeviceType", "Laptop"); | ||
dic8.Add("DataCenter", AGG_SYMBOL); | ||
points.Add(new Point(300, 200, true, dic8)); | ||
|
||
Dictionary<string, Object> dic9 = new Dictionary<string, Object>(); | ||
dic9.Add("Country", "UK"); | ||
dic9.Add("DeviceType", "Mobile"); | ||
dic9.Add("DataCenter", AGG_SYMBOL); | ||
points.Add(new Point(1200, 300, true, dic9)); | ||
|
||
return points; | ||
} | ||
|
||
private static Dictionary<string, Object> GetAnomalyDimension() | ||
{ | ||
Dictionary<string, Object> dim = new Dictionary<string, Object>(); | ||
dim.Add("Country", "UK"); | ||
dim.Add("DeviceType", AGG_SYMBOL); | ||
dim.Add("DataCenter", AGG_SYMBOL); | ||
|
||
return dim; | ||
} | ||
|
||
private static DateTime GetTimestamp() | ||
{ | ||
return new DateTime(2020, 3, 23, 0, 0, 0); | ||
} | ||
|
||
private class RootCauseLocalizationData | ||
{ | ||
[RootCauseLocalizationInputType] | ||
public RootCauseLocalizationInput Input { get; set; } | ||
|
||
public RootCauseLocalizationData() | ||
{ | ||
Input = null; | ||
} | ||
|
||
public RootCauseLocalizationData(DateTime anomalyTimestamp, Dictionary<string, Object> anomalyDimensions, List<MetricSlice> slices, AggregateType aggregateType, string aggregateSymbol) | ||
{ | ||
Input = new RootCauseLocalizationInput(anomalyTimestamp, anomalyDimensions, slices, aggregateType, | ||
aggregateSymbol); | ||
} | ||
} | ||
|
||
private class RootCauseLocalizationTransformedData | ||
{ | ||
[RootCauseType()] | ||
public RootCause RootCause { get; set; } | ||
|
||
public RootCauseLocalizationTransformedData() | ||
{ | ||
RootCause = null; | ||
} | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -143,9 +143,28 @@ public static SsaSpikeEstimator DetectSpikeBySsa(this TransformsCatalog catalog, | |
/// </format> | ||
/// </example> | ||
public static SrCnnAnomalyEstimator DetectAnomalyBySrCnn(this TransformsCatalog catalog, string outputColumnName, string inputColumnName, | ||
int windowSize=64, int backAddWindowSize=5, int lookaheadWindowSize=5, int averageingWindowSize=3, int judgementWindowSize=21, double threshold=0.3) | ||
int windowSize = 64, int backAddWindowSize = 5, int lookaheadWindowSize = 5, int averageingWindowSize = 3, int judgementWindowSize = 21, double threshold = 0.3) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it's spelled "averaging" (without an "e"). If this is an easy change to make, would be a good way to keep our code base looking high quality. |
||
=> new SrCnnAnomalyEstimator(CatalogUtils.GetEnvironment(catalog), outputColumnName, windowSize, backAddWindowSize, lookaheadWindowSize, averageingWindowSize, judgementWindowSize, threshold, inputColumnName); | ||
|
||
/// <summary> | ||
/// Create <see cref="RootCauseLocalizationEstimator"/>, which localizes root causes using decision tree algorithm. | ||
/// </summary> | ||
/// <param name="catalog">The transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>. | ||
/// The column data is an instance of <see cref="Microsoft.ML.TimeSeries.RootCause"/>.</param> | ||
/// <param name="inputColumnName">Name of the input column. | ||
/// The column data is an instance of <see cref="Microsoft.ML.TimeSeries.RootCauseLocalizationInput"/>.</param> | ||
/// <param name="beta">The weight parameter in score. The range of the parameter should be in [0,1].</param> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate on the beta parameter in the comment? #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
updated #Resolved |
||
/// <example> | ||
/// <format type="text/markdown"> | ||
/// <![CDATA[ | ||
/// [!code-csharp[LocalizeRootCause](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/TimeSeries/LocalizeRootCause.cs)] | ||
/// ]]> | ||
/// </format> | ||
/// </example> | ||
public static RootCauseLocalizationEstimator LocalizeRootCause(this TransformsCatalog catalog, string outputColumnName, string inputColumnName = null, double beta = 0.5) | ||
=> new RootCauseLocalizationEstimator(CatalogUtils.GetEnvironment(catalog), outputColumnName, inputColumnName ?? outputColumnName, beta); | ||
|
||
/// <summary> | ||
/// Singular Spectrum Analysis (SSA) model for univariate time-series forecasting. | ||
/// For the details of the model, refer to http://arxiv.org/pdf/1206.6910.pdf. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one measure is #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated #Resolved