Add metrics in docs

NickNtamp · web-flow · commit 82db42faa6a7 · 2023-01-16T16:07:09.000+02:00
diff --git a/docs/mkdocs/docs/glossary/metric-definitions.md b/docs/mkdocs/docs/glossary/metric-definitions.md
@@ -0,0 +1,186 @@
+# Metric Definitions
+
+## Descriptive statistics
+
+### Missing values
+Missing values metric calculates the summary of the number of missing values per feature. Missing values include `NaN` in numeric arrays, `NaN` or `None` in object arrays and `NaT` in datetimelike.
+
+### Non-Missing values
+Non-Missing values metric calculates the summary of the number of non-missing values per feature. Non-Missing values are all values beside `NaN` for numeric arrays, `NaN` or `None` for object arrays and `NaT` for datetimelike.
+
+### Mean or Average value
+Returns the average value per feature excluding `NaN` and `null` values.
+
+### Minimum value
+Returns the minimum value per feature.
+
+### Maximum value
+Returns the maximum value per feature.
+
+### Summary
+Returns the summary of the values per feature. Excludes `NaN` and `null` values during calculations.
+
+### Standard Deviation
+Returns the sample standard deviation per feature normalized by N-1 excluding `NaN` and `null` values during calculations. Formula:
+
+$$ 
+σ = \sqrt{Σ(x_i-μ)^2 \over Ν-1}
+$$
+
+### Variance
+Returns the unbiased variance per feature normalized by N-1 excluding `NaN` and `null` values during calculations. Formula:
+
+$$ 
+σ^2 = {Σ(x_i-μ)^2 \over Ν-1}
+$$
+
+## Evaluation metrics
+
+### Confusion Matrix
+Returns the number of TP, TN, FP and FN. In case of a `multi-class classification` returns the number of TP, TN, FP and FN per class.
+
+A typical example for `binary classification` could be seen below in which:
+
+- 20 observations were correctly classified as positive.
+- 10 observations were incorrectly classified as negative while they were actually positive.
+- 5 observations were incorrectly classified as positive while they were actually negative.
+- 75 observations were correctly classified as negative.
+
+|               | Predicted Positive | Predicted Negative |
+|----------------|--------------------|--------------------|
+| Actual Positive |         20  *(TP)*       |         10 *(FN)*        |
+| Actual Negative |         5 *(FP)*          |         75 *(TN)*         |
+
+A typical example for `multi-class classification` could be seen below in which:
+
+- 15 observations were correctly classified as Class A.
+- 5 observations were incorrectly classified as Class B while they were actually Class A.
+- 2 observations were incorrectly classified as Class C while they were actually Class A.
+- 4 observations were incorrectly classified as Class A while they were actually Class B.
+- 20 observations were correctly classified as Class B.
+- 3 observations were incorrectly classified as Class C while they were actually Class B.
+- 2 observations were incorrectly classified as Class A while they were actually Class C.
+- 8 observations were incorrectly classified as Class B while they were actually Class C.
+- 25 observations were correctly classified as Class C.
+
+|               | Predicted Class A | Predicted Class B | Predicted Class C |
+|----------------|--------------------|--------------------|--------------------|
+| Actual Class A |         15 *(TP_A)*         |          5        |         2         |
+| Actual Class B |         4         |         20 *(TP_B)*        |         3         |
+| Actual Class C |         2          |         8         |         25 *(TP_C)*        |
+
+### Accuracy
+Returns the accuracy classification score. In `multi-class classification`, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. Formula:
+
+$$ 
+accuracy = {(TP + TN) \over (TP + TN + FP + FN)}
+$$
+
+### Precision
+Returns the precision classification score. In `multi-class classification`, returns the below 3 scores:
+
+- `micro`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
+- `macro`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+- `weighted`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
+
+Formula:
+
+$$ 
+precision = {TP \over (TP + FP)}
+$$
+
+### Recall
+Returns the recall classification score. In `multi-class classification`, returns the below 3 scores:
+
+- `micro`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
+- `macro`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+- `weighted`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
+
+Formula:
+
+$$ 
+recall = {TP \over (TP + FN)}
+$$
+
+### F1 score
+Returns the f1 classification score. In `multi-class classification`, returns the below 3 scores:
+
+- `micro`: Calculate metrics globally by counting the total true positives, false negatives and false positives.
+- `macro`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+- `weighted`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
+
+Formula:
+
+$$ 
+F1 = 2 * {(precision * recall) \over (precision + recall)}
+$$
+
+## Statistical tests and techniques
+
+### Kolmogorov-Smirnov Two Sample test
+When there are two datasets then K-S two sample test can be used to test the agreement between their distributions. The null hypothesis states that there is no difference between the two distributions. Formula:
+
+$$ 
+D = Max|{F_a(X)-F_b(X)}| 
+$$
+
+where:
+
+- $a$ = observations from first dataset.
+- $b$ = observations from second dataset.
+- $F_n(X)$ = observed cumulative frequency distribution of a random sample of n observations.
+
+### Chi-squared test
+A chi-square test is a statistical test used to compare 2 datasets. The purpose of this test is to determine if a difference between data of 2 datasets is due to chance, or if it is due to a relationship between the variables you are studying. Formula:
+
+$$ 
+x^2 = Σ{(O_i - E_i)^2 \over E_i}
+$$
+
+where:
+
+- $x^2$ = chi-square
+- $O_i$ = 1st dataset values
+- $E_i$ = 2nd dataset values
+
+### Z-score for independent proportions
+The purpose of the z-test for independent proportions is to compare two independent datasets. Formula:
+
+$$ 
+Z = {p_1 - p_2 \over \sqrt{p'  q' ({1\over n_1} + {1\over n_2})}}
+$$
+
+where:
+
+- $Z$ = Z-statistic which is compared to the standard normal deviate
+- $p_1 , p_2$ = two datasets proportions
+- $p'$ = estimated true proportion under the null hypothesis
+- $q'$ = $(1-p')$
+- $n_1 , n_2$ = number of observations in two datasets
+
+### Wasserstein distance
+The Wasserstein distance is a metric to describe the distance between the distributions of 2 datasets. Formula:
+
+$$ 
+W = ({\int_0^1}{{|{F_A}^{-1}(u) - {F_B}^{-1}(u)|}^2 du} )^{0.5}
+$$
+
+where:
+
+- $W$ = Wasserstein distance
+- $F_A , F_B$ = corresponding cumulative distribution functions of two datasets
+- ${F_A}^{-1} , {F_B}^{-1}$ = respective quantile functions
+
+### Jensen–Shannon divergence
+The Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. Formula:
+
+$$ 
+JS = 1/2 * KL(P || M) + 1/2 * KL(Q || M)
+$$
+
+where:
+
+- $JS$ = Jensen–Shannon divergence
+- $KL$ = Kullback-Leibler divergence: $– sum x$ in $X$ $P(x)$ * $log(Q(x) / P(x))$
+- $P,Q$ = distributions of 2 datasets
+- $M$ = ${1 \over 2} * (P+Q)$
diff --git a/docs/mkdocs/docs/mathjax-config.js b/docs/mkdocs/docs/mathjax-config.js
@@ -0,0 +1,27 @@
+/* mathjax-loader.js  file */
+/* ref: http://facelessuser.github.io/pymdown-extensions/extensions/arithmatex/ */
+(function (win, doc) {
+    win.MathJax = {
+      config: ["MMLorHTML.js"],
+      extensions: ["tex2jax.js"],
+      jax: ["input/TeX"],
+      tex2jax: {
+        inlineMath: [ ["\\(","\\)"] ],
+        displayMath: [ ["\\[","\\]"] ]
+      },
+      TeX: {
+        TagSide: "right",
+        TagIndent: ".8em",
+        MultLineWidth: "85%",
+        equationNumbers: {
+          autoNumber: "AMS",
+        },
+        unicode: {
+          fonts: "STIXGeneral,'Arial Unicode MS'"
+        }
+      },
+      displayAlign: 'center',
+      showProcessingMessages: false,
+      messageStyle: 'none'
+    };
+  })(window, document);
diff --git a/docs/mkdocs/docs/metrics.md b/docs/mkdocs/docs/metrics.md
diff --git a/docs/mkdocs/docs/monitor/descriptive-statistics.md b/docs/mkdocs/docs/monitor/descriptive-statistics.md
@@ -0,0 +1,14 @@
+# Descriptive Statistics
+
+The descriptive statistics are calculated per feature on any given dataset.
+
+| Metric          | Type of data              |
+| --------------- | ------------------| 
+| <a href="/glossary/metric-definitions/#missing-values" class="external-link" target="_blank">**missing_count**</a>      | `numerical` & `categorical`|
+| <a href="/glossary/metric-definitions/#non-missing-values" class="external-link" target="_blank">**non_missing_count**</a>       | `numerical` & `categorical`                     |
+| <a href="/glossary/metric-definitions/#mean-or-average-value" class="external-link" target="_blank">**mean**</a>     | `numerical` | 
+| <a href="/glossary/metric-definitions/#minimum-value" class="external-link" target="_blank">**minimum**</a>   | `numerical`                     |
+| <a href="/glossary/metric-definitions/#maximum-value" class="external-link" target="_blank">**maximum**</a> | `numerical`                     |
+| <a href="/glossary/metric-definitions/#summary" class="external-link" target="_blank">**sum**</a>      | `numerical`          | 
+| <a href="/glossary/metric-definitions/#standard-deviation" class="external-link" target="_blank">**standard_deviation**</a> | `numerical`                 | 
+| <a href="/glossary/metric-definitions/#standard-deviation" class="external-link" target="_blank">**variance**</a>  | `numerical`                     | 
diff --git a/docs/mkdocs/docs/monitor/drift-metrics.md b/docs/mkdocs/docs/monitor/drift-metrics.md
@@ -0,0 +1,85 @@
+# Drift metrics
+
+The target of drift metrics is to calculate the data drift between 2 datasets. Currently supported drift types are:
+
+- Data drift
+- Concept drift
+
+## Data drift
+
+An analysis happens comparing the current data to the reference data estimating the distributions of each feature in the two datasets. The schema of both datasets should be identical.
+
+Returns a drift summary of the following form:
+
+```
+{'timestamp': the timestamp of the report,
+ 'drift_summary': 
+    {'number_of_columns': total number of dataset columns,
+     'number_of_drifted_columns': total number of drifted columns,
+     'share_of_drifted_columns': ('number_of_drifted_columns/'number_of_columns'),
+     'dataset_drift': Boolean based on the criteria below,
+     'drift_by_columns': 
+        {'column1': {'column_name': 'column1',
+                     'column_type': the type of column (e.g. num),
+                     'stattest_name': the statistical test tha was used,
+                     'drift_score': the drifting score based on the test,
+                     'drift_detected': Boolean based on the criteria below,
+                     'threshold': a float number based on the criteria below}, 
+                    {......}
+        }
+    }
+}
+```
+
+Logic to choose the appropriate statistical test is based on:
+
+- feature type: categorical or numerical
+- the number of observations in the reference dataset
+- the number of unique values in the feature (n_unique)
+
+For small data with <= 1000 observations in the reference dataset:
+
+- For numerical features (n_unique > 5): <a href="/glossary/metric-definitions/#kolmogorov-smirnov-two-sample-test" class="external-link" target="_blank">two-sample Kolmogorov-Smirnov test</a>.
+- For categorical features or numerical features with n_unique <= 5: <a href="/glossary/metric-definitions/#chi-squared-test" class="external-link" target="_blank">chi-squared test</a>.
+- For binary categorical features (n_unique <= 2), we use the proportion difference test for independent samples based on <a href="/glossary/metric-definitions/#z-score-for-independent-proportions" class="external-link" target="_blank">Z-score</a>.
+    
+All tests use a 0.95 confidence level by default.
+    
+For larger data with > 1000 observations in the reference dataset:
+
+- For numerical features (n_unique > 5): <a href="/glossary/metric-definitions/#wasserstein-distance" class="external-link" target="_blank">Wasserstein Distance</a>.
+- For categorical features or numerical with n_unique <= 5): <a href="/glossary/metric-definitions/#jensenshannon-divergence" class="external-link" target="_blank">Jensen–Shannon divergence</a>.
+
+All tests use a threshold = 0.1 by default.
+
+## Concept drift
+
+An analysis happens comparing the current target feature to the reference target feature.
+
+Returns a concept drift summary of the following form:
+
+```
+{'timestamp': the timestamp of the report,
+ 'concept_drift_summary': 
+    {'column_name': 'column1',
+     'column_type': the type of column (e.g. num),
+     'stattest_name': the statistical test tha was used,
+     'threshold': threshold used based on criteria below,
+     'drift_score': the drifting score based on the test,
+     'drift_detected': Boolean based on the criteria below,
+    } 
+}
+```
+Logic to choose the appropriate statistical test is based on:
+
+- the number of observations in the reference dataset
+- the number of unique values in the target (n_unique)
+    
+For small data with <= 1000 observations in the reference dataset:
+    
+- For categorical target with n_unique > 2: <a href="/glossary/metric-definitions/#chi-squared-test" class="external-link" target="_blank">chi-squared test</a>.
+- For binary categorical target (n_unique <= 2), we use the proportion difference test for independent samples based on <a href="/glossary/metric-definitions/#z-score-for-independent-proportions" class="external-link" target="_blank">Z-score</a>.
+
+All tests use a 0.95 confidence level by default.
+    
+For larger data with > 1000 observations in the reference dataset we use <a href="/glossary/metric-definitions/#jensenshannon-divergence" class="external-link" target="_blank">Jensen–Shannon divergence</a> with a threshold = 0.1 .
diff --git a/docs/mkdocs/docs/monitor/evaluation-metrics.md b/docs/mkdocs/docs/monitor/evaluation-metrics.md
@@ -0,0 +1,14 @@
+# Evaluation Metrics
+
+The target of evaluation metrics is to evaluate the quality of an machine learning model. Currently supported models are:
+
+- Binary classification
+- Multi-class classification
+
+| Metric          | Supported model              |
+| --------------- | ------------------| 
+| <a href="/glossary/metric-definitions/#confusion-matrix" class="external-link" target="_blank">**confusion matrix**</a>      | `binary classification` & `multi-class classification`|
+| <a href="/glossary/metric-definitions/#accuracy" class="external-link" target="_blank">**accuracy**</a>       | `binary classification` & `multi-class classification`                    |
+| <a href="/glossary/metric-definitions/#precision" class="external-link" target="_blank">**precision**</a>     | `binary classification` & `multi-class classification` | 
+| <a href="/glossary/metric-definitions/#recall" class="external-link" target="_blank">**recall**</a>   | `binary classification` & `multi-class classification`                     |
+| <a href="/glossary/metric-definitions/#f1-score" class="external-link" target="_blank">**f1 score**</a> | `binary classification` & `multi-class classification`                   |
diff --git a/docs/mkdocs/docs/monitor/explainability.md b/docs/mkdocs/docs/monitor/explainability.md
@@ -0,0 +1,2 @@
+# Explainability
+
diff --git a/docs/mkdocs/mkdocs.yml b/docs/mkdocs/mkdocs.yml
@@ -27,12 +27,19 @@ nav:
   - Whitebox: index.md
   - Feaures: features.md
   - ML Monitoring: ml-monitoring.md
-  - Metrics: metrics.md
+  - Monitor: 
+    - monitor/descriptive-statistics.md
+    - monitor/evaluation-metrics.md
+    - monitor/drift-metrics.md
+    - monitor/explainability.md
   - Tutorial - User Guide: 
     - tutorial/index.md
   - Deploymemt: deployment.md
   - SDK Documantation: sdk-docs.md
+  - Glossary:
+    - glossary/metric-definitions.md
 markdown_extensions:
+  - pymdownx.arithmatex
   - pymdownx.highlight:
       anchor_linenums: true
   - admonition
@@ -50,3 +57,5 @@ extra_css:
 extra_javascript:
 - js/termynal.js
 - js/custom.js
+- mathjax-config.js
+- https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML