Skip to content

Commit 82db42f

Browse files
authored
Add metrics in docs
1 parent 7b47366 commit 82db42f

File tree

8 files changed

+338
-2
lines changed

8 files changed

+338
-2
lines changed
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Metric Definitions
2+
3+
## Descriptive statistics
4+
5+
### Missing values
6+
Missing values metric calculates the summary of the number of missing values per feature. Missing values include `NaN` in numeric arrays, `NaN` or `None` in object arrays and `NaT` in datetimelike.
7+
8+
### Non-Missing values
9+
Non-Missing values metric calculates the summary of the number of non-missing values per feature. Non-Missing values are all values beside `NaN` for numeric arrays, `NaN` or `None` for object arrays and `NaT` for datetimelike.
10+
11+
### Mean or Average value
12+
Returns the average value per feature excluding `NaN` and `null` values.
13+
14+
### Minimum value
15+
Returns the minimum value per feature.
16+
17+
### Maximum value
18+
Returns the maximum value per feature.
19+
20+
### Summary
21+
Returns the summary of the values per feature. Excludes `NaN` and `null` values during calculations.
22+
23+
### Standard Deviation
24+
Returns the sample standard deviation per feature normalized by N-1 excluding `NaN` and `null` values during calculations. Formula:
25+
26+
$$
27+
σ = \sqrt{Σ(x_i-μ)^2 \over Ν-1}
28+
$$
29+
30+
### Variance
31+
Returns the unbiased variance per feature normalized by N-1 excluding `NaN` and `null` values during calculations. Formula:
32+
33+
$$
34+
σ^2 = {Σ(x_i-μ)^2 \over Ν-1}
35+
$$
36+
37+
## Evaluation metrics
38+
39+
### Confusion Matrix
40+
Returns the number of TP, TN, FP and FN. In case of a `multi-class classification` returns the number of TP, TN, FP and FN per class.
41+
42+
A typical example for `binary classification` could be seen below in which:
43+
44+
- 20 observations were correctly classified as positive.
45+
- 10 observations were incorrectly classified as negative while they were actually positive.
46+
- 5 observations were incorrectly classified as positive while they were actually negative.
47+
- 75 observations were correctly classified as negative.
48+
49+
| | Predicted Positive | Predicted Negative |
50+
|----------------|--------------------|--------------------|
51+
| Actual Positive | 20 *(TP)* | 10 *(FN)* |
52+
| Actual Negative | 5 *(FP)* | 75 *(TN)* |
53+
54+
A typical example for `multi-class classification` could be seen below in which:
55+
56+
- 15 observations were correctly classified as Class A.
57+
- 5 observations were incorrectly classified as Class B while they were actually Class A.
58+
- 2 observations were incorrectly classified as Class C while they were actually Class A.
59+
- 4 observations were incorrectly classified as Class A while they were actually Class B.
60+
- 20 observations were correctly classified as Class B.
61+
- 3 observations were incorrectly classified as Class C while they were actually Class B.
62+
- 2 observations were incorrectly classified as Class A while they were actually Class C.
63+
- 8 observations were incorrectly classified as Class B while they were actually Class C.
64+
- 25 observations were correctly classified as Class C.
65+
66+
| | Predicted Class A | Predicted Class B | Predicted Class C |
67+
|----------------|--------------------|--------------------|--------------------|
68+
| Actual Class A | 15 *(TP_A)* | 5 | 2 |
69+
| Actual Class B | 4 | 20 *(TP_B)* | 3 |
70+
| Actual Class C | 2 | 8 | 25 *(TP_C)* |
71+
72+
### Accuracy
73+
Returns the accuracy classification score. In `multi-class classification`, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. Formula:
74+
75+
$$
76+
accuracy = {(TP + TN) \over (TP + TN + FP + FN)}
77+
$$
78+
79+
### Precision
80+
Returns the precision classification score. In `multi-class classification`, returns the below 3 scores:
81+
82+
- `micro`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
83+
- `macro`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
84+
- `weighted`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
85+
86+
Formula:
87+
88+
$$
89+
precision = {TP \over (TP + FP)}
90+
$$
91+
92+
### Recall
93+
Returns the recall classification score. In `multi-class classification`, returns the below 3 scores:
94+
95+
- `micro`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
96+
- `macro`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
97+
- `weighted`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
98+
99+
Formula:
100+
101+
$$
102+
recall = {TP \over (TP + FN)}
103+
$$
104+
105+
### F1 score
106+
Returns the f1 classification score. In `multi-class classification`, returns the below 3 scores:
107+
108+
- `micro`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
109+
- `macro`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
110+
- `weighted`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
111+
112+
Formula:
113+
114+
$$
115+
F1 = 2 * {(precision * recall) \over (precision + recall)}
116+
$$
117+
118+
## Statistical tests and techniques
119+
120+
### Kolmogorov-Smirnov Two Sample test
121+
When there are two datasets then K-S two sample test can be used to test the agreement between their distributions. The null hypothesis states that there is no difference between the two distributions. Formula:
122+
123+
$$
124+
D = Max|{F_a(X)-F_b(X)}|
125+
$$
126+
127+
where:
128+
129+
- $a$ = observations from first dataset.
130+
- $b$ = observations from second dataset.
131+
- $F_n(X)$ = observed cumulative frequency distribution of a random sample of n observations.
132+
133+
### Chi-squared test
134+
A chi-square test is a statistical test used to compare 2 datasets. The purpose of this test is to determine if a difference between data of 2 datasets is due to chance, or if it is due to a relationship between the variables you are studying. Formula:
135+
136+
$$
137+
x^2 = Σ{(O_i - E_i)^2 \over E_i}
138+
$$
139+
140+
where:
141+
142+
- $x^2$ = chi-square
143+
- $O_i$ = 1st dataset values
144+
- $E_i$ = 2nd dataset values
145+
146+
### Z-score for independent proportions
147+
The purpose of the z-test for independent proportions is to compare two independent datasets. Formula:
148+
149+
$$
150+
Z = {p_1 - p_2 \over \sqrt{p' q' ({1\over n_1} + {1\over n_2})}}
151+
$$
152+
153+
where:
154+
155+
- $Z$ = Z-statistic which is compared to the standard normal deviate
156+
- $p_1 , p_2$ = two datasets proportions
157+
- $p'$ = estimated true proportion under the null hypothesis
158+
- $q'$ = $(1-p')$
159+
- $n_1 , n_2$ = number of observations in two datasets
160+
161+
### Wasserstein distance
162+
The Wasserstein distance is a metric to describe the distance between the distributions of 2 datasets. Formula:
163+
164+
$$
165+
W = ({\int_0^1}{{|{F_A}^{-1}(u) - {F_B}^{-1}(u)|}^2 du} )^{0.5}
166+
$$
167+
168+
where:
169+
170+
- $W$ = Wasserstein distance
171+
- $F_A , F_B$ = corresponding cumulative distribution functions of two datasets
172+
- ${F_A}^{-1} , {F_B}^{-1}$ = respective quantile functions
173+
174+
### Jensen–Shannon divergence
175+
The Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. Formula:
176+
177+
$$
178+
JS = 1/2 * KL(P || M) + 1/2 * KL(Q || M)
179+
$$
180+
181+
where:
182+
183+
- $JS$ = Jensen–Shannon divergence
184+
- $KL$ = Kullback-Leibler divergence: $– sum x$ in $X$ $P(x)$ * $log(Q(x) / P(x))$
185+
- $P,Q$ = distributions of 2 datasets
186+
- $M$ = ${1 \over 2} * (P+Q)$

docs/mkdocs/docs/mathjax-config.js

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
/* mathjax-loader.js file */
2+
/* ref: http://facelessuser.github.io/pymdown-extensions/extensions/arithmatex/ */
3+
(function (win, doc) {
4+
win.MathJax = {
5+
config: ["MMLorHTML.js"],
6+
extensions: ["tex2jax.js"],
7+
jax: ["input/TeX"],
8+
tex2jax: {
9+
inlineMath: [ ["\\(","\\)"] ],
10+
displayMath: [ ["\\[","\\]"] ]
11+
},
12+
TeX: {
13+
TagSide: "right",
14+
TagIndent: ".8em",
15+
MultLineWidth: "85%",
16+
equationNumbers: {
17+
autoNumber: "AMS",
18+
},
19+
unicode: {
20+
fonts: "STIXGeneral,'Arial Unicode MS'"
21+
}
22+
},
23+
displayAlign: 'center',
24+
showProcessingMessages: false,
25+
messageStyle: 'none'
26+
};
27+
})(window, document);

docs/mkdocs/docs/metrics.md

Lines changed: 0 additions & 1 deletion
This file was deleted.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Descriptive Statistics
2+
3+
The descriptive statistics are calculated per feature on any given dataset.
4+
5+
| Metric | Type of data |
6+
| --------------- | ------------------|
7+
| <a href="/glossary/metric-definitions/#missing-values" class="external-link" target="_blank">**missing_count**</a> | `numerical` & `categorical`|
8+
| <a href="/glossary/metric-definitions/#non-missing-values" class="external-link" target="_blank">**non_missing_count**</a> | `numerical` & `categorical` |
9+
| <a href="/glossary/metric-definitions/#mean-or-average-value" class="external-link" target="_blank">**mean**</a> | `numerical` |
10+
| <a href="/glossary/metric-definitions/#minimum-value" class="external-link" target="_blank">**minimum**</a> | `numerical` |
11+
| <a href="/glossary/metric-definitions/#maximum-value" class="external-link" target="_blank">**maximum**</a> | `numerical` |
12+
| <a href="/glossary/metric-definitions/#summary" class="external-link" target="_blank">**sum**</a> | `numerical` |
13+
| <a href="/glossary/metric-definitions/#standard-deviation" class="external-link" target="_blank">**standard_deviation**</a> | `numerical` |
14+
| <a href="/glossary/metric-definitions/#standard-deviation" class="external-link" target="_blank">**variance**</a> | `numerical` |
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Drift metrics
2+
3+
The target of drift metrics is to calculate the data drift between 2 datasets. Currently supported drift types are:
4+
5+
- Data drift
6+
- Concept drift
7+
8+
## Data drift
9+
10+
An analysis happens comparing the current data to the reference data estimating the distributions of each feature in the two datasets. The schema of both datasets should be identical.
11+
12+
Returns a drift summary of the following form:
13+
14+
```
15+
{'timestamp': the timestamp of the report,
16+
'drift_summary':
17+
{'number_of_columns': total number of dataset columns,
18+
'number_of_drifted_columns': total number of drifted columns,
19+
'share_of_drifted_columns': ('number_of_drifted_columns/'number_of_columns'),
20+
'dataset_drift': Boolean based on the criteria below,
21+
'drift_by_columns':
22+
{'column1': {'column_name': 'column1',
23+
'column_type': the type of column (e.g. num),
24+
'stattest_name': the statistical test tha was used,
25+
'drift_score': the drifting score based on the test,
26+
'drift_detected': Boolean based on the criteria below,
27+
'threshold': a float number based on the criteria below},
28+
{......}
29+
}
30+
}
31+
}
32+
```
33+
34+
Logic to choose the appropriate statistical test is based on:
35+
36+
- feature type: categorical or numerical
37+
- the number of observations in the reference dataset
38+
- the number of unique values in the feature (n_unique)
39+
40+
For small data with <= 1000 observations in the reference dataset:
41+
42+
- For numerical features (n_unique > 5): <a href="/glossary/metric-definitions/#kolmogorov-smirnov-two-sample-test" class="external-link" target="_blank">two-sample Kolmogorov-Smirnov test</a>.
43+
- For categorical features or numerical features with n_unique <= 5: <a href="/glossary/metric-definitions/#chi-squared-test" class="external-link" target="_blank">chi-squared test</a>.
44+
- For binary categorical features (n_unique <= 2), we use the proportion difference test for independent samples based on <a href="/glossary/metric-definitions/#z-score-for-independent-proportions" class="external-link" target="_blank">Z-score</a>.
45+
46+
All tests use a 0.95 confidence level by default.
47+
48+
For larger data with > 1000 observations in the reference dataset:
49+
50+
- For numerical features (n_unique > 5): <a href="/glossary/metric-definitions/#wasserstein-distance" class="external-link" target="_blank">Wasserstein Distance</a>.
51+
- For categorical features or numerical with n_unique <= 5): <a href="/glossary/metric-definitions/#jensenshannon-divergence" class="external-link" target="_blank">Jensen–Shannon divergence</a>.
52+
53+
All tests use a threshold = 0.1 by default.
54+
55+
## Concept drift
56+
57+
An analysis happens comparing the current target feature to the reference target feature.
58+
59+
Returns a concept drift summary of the following form:
60+
61+
```
62+
{'timestamp': the timestamp of the report,
63+
'concept_drift_summary':
64+
{'column_name': 'column1',
65+
'column_type': the type of column (e.g. num),
66+
'stattest_name': the statistical test tha was used,
67+
'threshold': threshold used based on criteria below,
68+
'drift_score': the drifting score based on the test,
69+
'drift_detected': Boolean based on the criteria below,
70+
}
71+
}
72+
```
73+
Logic to choose the appropriate statistical test is based on:
74+
75+
- the number of observations in the reference dataset
76+
- the number of unique values in the target (n_unique)
77+
78+
For small data with <= 1000 observations in the reference dataset:
79+
80+
- For categorical target with n_unique > 2: <a href="/glossary/metric-definitions/#chi-squared-test" class="external-link" target="_blank">chi-squared test</a>.
81+
- For binary categorical target (n_unique <= 2), we use the proportion difference test for independent samples based on <a href="/glossary/metric-definitions/#z-score-for-independent-proportions" class="external-link" target="_blank">Z-score</a>.
82+
83+
All tests use a 0.95 confidence level by default.
84+
85+
For larger data with > 1000 observations in the reference dataset we use <a href="/glossary/metric-definitions/#jensenshannon-divergence" class="external-link" target="_blank">Jensen–Shannon divergence</a> with a threshold = 0.1 .
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Evaluation Metrics
2+
3+
The target of evaluation metrics is to evaluate the quality of an machine learning model. Currently supported models are:
4+
5+
- Binary classification
6+
- Multi-class classification
7+
8+
| Metric | Supported model |
9+
| --------------- | ------------------|
10+
| <a href="/glossary/metric-definitions/#confusion-matrix" class="external-link" target="_blank">**confusion matrix**</a> | `binary classification` & `multi-class classification`|
11+
| <a href="/glossary/metric-definitions/#accuracy" class="external-link" target="_blank">**accuracy**</a> | `binary classification` & `multi-class classification` |
12+
| <a href="/glossary/metric-definitions/#precision" class="external-link" target="_blank">**precision**</a> | `binary classification` & `multi-class classification` |
13+
| <a href="/glossary/metric-definitions/#recall" class="external-link" target="_blank">**recall**</a> | `binary classification` & `multi-class classification` |
14+
| <a href="/glossary/metric-definitions/#f1-score" class="external-link" target="_blank">**f1 score**</a> | `binary classification` & `multi-class classification` |
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Explainability
2+

docs/mkdocs/mkdocs.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,19 @@ nav:
2727
- Whitebox: index.md
2828
- Feaures: features.md
2929
- ML Monitoring: ml-monitoring.md
30-
- Metrics: metrics.md
30+
- Monitor:
31+
- monitor/descriptive-statistics.md
32+
- monitor/evaluation-metrics.md
33+
- monitor/drift-metrics.md
34+
- monitor/explainability.md
3135
- Tutorial - User Guide:
3236
- tutorial/index.md
3337
- Deploymemt: deployment.md
3438
- SDK Documantation: sdk-docs.md
39+
- Glossary:
40+
- glossary/metric-definitions.md
3541
markdown_extensions:
42+
- pymdownx.arithmatex
3643
- pymdownx.highlight:
3744
anchor_linenums: true
3845
- admonition
@@ -50,3 +57,5 @@ extra_css:
5057
extra_javascript:
5158
- js/termynal.js
5259
- js/custom.js
60+
- mathjax-config.js
61+
- https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML

0 commit comments

Comments
 (0)