ML-based Anomaly Detection for data content #990

vb-dbrks · 2026-01-08T15:59:10Z

Changes

This PR adds ML-based anomaly detection to DQX, enabling users to detect unusual patterns in their data that can't be caught by traditional rule-based checks.

Key features:

Auto-discovery: Automatically selects relevant columns and creates segmented models when needed
Isolation Forest: Uses scikit-learn's Isolation Forest algorithm for fast, scalable anomaly detection
Explainability: SHAP-based feature contributions show why records were flagged
Unity Catalog integration: Models stored in UC with full lineage and versioning
New check function: has_no_anomalies() works like other DQX checks
Production defaults: Ensemble models (2x), 0.60 threshold, contributions enabled by default

What's included:

New AnomalyEngine for training models
Feature engineering for numeric, categorical, datetime, and boolean columns
Model registry with drift detection
demo 101
documentation updates

Resolves #957

Tests

manually tested (ran all demos on Databricks)
added unit tests (124 tests across 7 test files)
added integration tests (93+ tests covering training, scoring, ensemble, drift, etc.)
added end-to-end tests
added performance tests

…ssue. UC Mlflow fully qualified name

…pytest marker for anomaly to remove warnings.

…ference - Updated the `REF_NAME` environment variable in GitHub Actions to support both PRs and push events, improving flexibility for end-to-end tests. - Refactored the `library_ref` fixture in `conftest.py` to automatically detect the current git branch for local testing, enhancing usability and consistency across environments.

mwojtyczka

The complete guide doc must be restructured. It's too technical. We only need info on how to train and configure the check, and how to analyze the results. All the rest can go to Reference.

mwojtyczka · 2026-01-22T08:37:01Z

src/databricks/labs/dqx/anomaly/check_funcs.py

+
+
+@register_rule("dataset")
+def has_no_anomalies(


public functions should be on top of the file for easier reading

demos/dqx_anomaly_detection_101_demo.py

mwojtyczka · 2026-01-22T14:31:14Z

docs/dqx/docs/guide/additional_configuration.mdx

 ## Customizing result columns

-By default, DQX appends `_error` and `_warning` result columns to the output DataFrame or Table to flag quality issues.
+By default, DQX appends `_errors`, `_warnings`, and `_info` result columns to the output DataFrame or Table to flag quality issues.


Suggested change

By default, DQX appends `_errors`, `_warnings`, and `_info` result columns to the output DataFrame or Table to flag quality issues.

By default, DQX appends `_errors`, `_warnings` result columns to the output DataFrame or Table to flag quality issues. For certain checks, DQX also stores additional metadata produced during quality evaluation in a `_dq_info` column.

let's call it _dq_info needs to be updated in other places

mwojtyczka · 2026-01-22T14:34:36Z

docs/dqx/docs/guide/additional_configuration.mdx

+|----------------|---------|-------------------|
+| `_errors` | Array of critical quality check failures | `errors` |
+| `_warnings` | Array of warning-level quality check issues | `warnings` |
+| `_info` | Structured metadata from dataset-level checks (e.g., anomaly detection) | `info` |


mwojtyczka · 2026-01-22T14:36:17Z

src/databricks/labs/dqx/engine.py

+        # Rename _info column to configured name if present (dataset-level checks like has_no_anomalies create it)
+        info_col_name = self._result_column_names[ColumnArguments.INFO]
+        if "_info" in result_df.columns and info_col_name != "_info":
+            result_df = result_df.withColumnRenamed("_info", info_col_name)


let's only rename to _dq_info

mwojtyczka · 2026-01-22T18:01:32Z