Skip to content

Conversation

@vb-dbrks
Copy link
Contributor

@vb-dbrks vb-dbrks commented Jan 8, 2026

Changes

This PR adds ML-based anomaly detection to DQX, enabling users to detect unusual patterns in their data that can't be caught by traditional rule-based checks.

Key features:

  • Auto-discovery: Automatically selects relevant columns and creates segmented models when needed
  • Isolation Forest: Uses scikit-learn's Isolation Forest algorithm for fast, scalable anomaly detection
  • Explainability: SHAP-based feature contributions show why records were flagged
  • Unity Catalog integration: Models stored in UC with full lineage and versioning
  • New check function: has_no_anomalies() works like other DQX checks
  • Production defaults: Ensemble models (2x), 0.60 threshold, contributions enabled by default

What's included:

  • New AnomalyEngine for training models
  • Feature engineering for numeric, categorical, datetime, and boolean columns
  • Model registry with drift detection
  • demo 101
  • documentation updates

Resolves #957

Tests

  • manually tested (ran all demos on Databricks)
  • added unit tests (124 tests across 7 test files)
  • added integration tests (93+ tests covering training, scoring, ensemble, drift, etc.)
  • added end-to-end tests
  • added performance tests

…ference

- Updated the `REF_NAME` environment variable in GitHub Actions to support both PRs and push events, improving flexibility for end-to-end tests.
- Refactored the `library_ref` fixture in `conftest.py` to automatically detect the current git branch for local testing, enhancing usability and consistency across environments.
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete guide doc must be restructured. It's too technical. We only need info on how to train and configure the check, and how to analyze the results. All the rest can go to Reference.



@register_rule("dataset")
def has_no_anomalies(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public functions should be on top of the file for easier reading

## Customizing result columns

By default, DQX appends `_error` and `_warning` result columns to the output DataFrame or Table to flag quality issues.
By default, DQX appends `_errors`, `_warnings`, and `_info` result columns to the output DataFrame or Table to flag quality issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, DQX appends `_errors`, `_warnings`, and `_info` result columns to the output DataFrame or Table to flag quality issues.
By default, DQX appends `_errors`, `_warnings` result columns to the output DataFrame or Table to flag quality issues. For certain checks, DQX also stores additional metadata produced during quality evaluation in a `_dq_info` column.

let's call it _dq_info needs to be updated in other places

|----------------|---------|-------------------|
| `_errors` | Array of critical quality check failures | `errors` |
| `_warnings` | Array of warning-level quality check issues | `warnings` |
| `_info` | Structured metadata from dataset-level checks (e.g., anomaly detection) | `info` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `_info` | Structured metadata from dataset-level checks (e.g., anomaly detection) | `info` |
| `_info` | Structured metadata created when certain checks are used (e.g., anomaly detection) | `info` |

# Rename _info column to configured name if present (dataset-level checks like has_no_anomalies create it)
info_col_name = self._result_column_names[ColumnArguments.INFO]
if "_info" in result_df.columns and info_col_name != "_info":
result_df = result_df.withColumnRenamed("_info", info_col_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's only rename to _dq_info


Built on [scikit-learn's Isolation Forest](https://scikit-learn.org/) with distributed Spark scoring, DQX provides production-ready anomaly detection with automatic feature engineering and explainability.

## Complements Databricks native monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Complements Databricks native monitoring
## Complements Databricks data quality monitoring

let's avoid the word, native, as dqx may be native soon as well


## Complements Databricks native monitoring

DQX's anomaly detection **works alongside** [Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DQX's anomaly detection **works alongside** [Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring:
Databricks DQX's anomaly detection **works alongside** [Databricks data quality monitoring anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring:


DQX's anomaly detection **works alongside** [Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring:

| **Feature** | **Databricks Native** | **DQX Anomaly Detection** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| **Feature** | **Databricks Native** | **DQX Anomaly Detection** |
| **Feature** | **Databricks Data Quality Monitoring** | **DQX Anomaly Detection** |


Built on [scikit-learn's Isolation Forest](https://scikit-learn.org/) with distributed Spark scoring, DQX provides production-ready anomaly detection with automatic feature engineering and explainability.

## Complements Databricks native monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this in the Reference doc already. No need to duplicate the same info. I would remove it from the guide and keep it in the reference.

2. Add DQX anomaly checks for critical tables (model-based)
3. Use both signals: late data alerts + row-level anomalies

## Production-Ready Defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to Reference doc. This is a way too technical for a guide.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete guide doc must be restructured. It's too technical. We only need info on how to train the model and configure the check, and how to analyze the results. All the rest can go to Reference.

if not run_config.input_config:
raise InvalidConfigError("input_config is required to run the anomaly trainer workflow.")

if not anomaly_config.model_name:
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think user should be configuring this. This should be internal to the library.



@lru_cache(maxsize=1)
def _load_anomaly_check_funcs():
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be not needed. Please follow the same approach as for PII_ENABLED. If there are circular dependencies they must be solved.

try:
return isinstance(value, expected_type)
except TypeError:
# For complex typing constructs (e.g., Callable, Protocol) that can't be validated at runtime,
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this? we should avoid silently ignoring type errors

num_trees: int = 200
max_depth: int | None = None
subsampling_rate: float | None = None
random_seed: int = 42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need seed by default?


columns: list[str] | None = None # Auto-discovered if omitted
segment_by: list[str] | None = None # Auto-discovered if omitted (when columns also omitted)
model_name: str | None = None
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_name: str | None = None

can we please not expose it to the user, they shouldn't be configuring this themselves. I'm also not sure they should be changing params or temporal config. We could keep it, but I wouldn't talk about it in the docs.

# Ensure the result DataFrame has the same columns as the input DataFrame + the new result column
return result_df.select(*df.columns, dest_col)
# Rename _info column to configured name if present (dataset-level checks like has_no_anomalies create it)
info_col_name = self._result_column_names[ColumnArguments.INFO]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too late, if user is using _info as a column name this will break before reaching this point

from databricks.labs.dqx.installer.logs import TaskLogger

# Optional anomaly detection support
try:
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update workflow installer to add the anomaly extras there
remote_wheels_with_extras = [f"{wheel}[llm,pii]" for wheel in remote_wheels]

from databricks.labs.dqx.anomaly.anomaly_workflow import AnomalyTrainerWorkflow

ANOMALY_ENABLED = True
except Exception:
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never be available; this code will be run as part of the CLI insta so relevant libraries must be available. You could add them to the cli extras in pyproject.toml. But much better solution would be to follow pattern that we have for data contract (DATACONTRACT_ENABLED) or pii. We can try best effort import in the anomaly engine.


df = read_input_data(ctx.spark, run_config.input_config)

ws = WorkspaceClient()
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow other workflows and take workspace client from ctx.workspace_client

DEFAULT_TRAIN_RATIO = 0.8


class AnomalyEngine(DQEngineBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move to a separate module anomaly_engine for clarity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: ML-based Anomaly Detection for row-level (has_no_anomalies)

3 participants