-
Notifications
You must be signed in to change notification settings - Fork 77
ML-based Anomaly Detection for data content #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ssue. UC Mlflow fully qualified name
…pytest marker for anomaly to remove warnings.
…ference - Updated the `REF_NAME` environment variable in GitHub Actions to support both PRs and push events, improving flexibility for end-to-end tests. - Refactored the `library_ref` fixture in `conftest.py` to automatically detect the current git branch for local testing, enhancing usability and consistency across environments.
mwojtyczka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complete guide doc must be restructured. It's too technical. We only need info on how to train and configure the check, and how to analyze the results. All the rest can go to Reference.
|
|
||
|
|
||
| @register_rule("dataset") | ||
| def has_no_anomalies( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public functions should be on top of the file for easier reading
| ## Customizing result columns | ||
|
|
||
| By default, DQX appends `_error` and `_warning` result columns to the output DataFrame or Table to flag quality issues. | ||
| By default, DQX appends `_errors`, `_warnings`, and `_info` result columns to the output DataFrame or Table to flag quality issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| By default, DQX appends `_errors`, `_warnings`, and `_info` result columns to the output DataFrame or Table to flag quality issues. | |
| By default, DQX appends `_errors`, `_warnings` result columns to the output DataFrame or Table to flag quality issues. For certain checks, DQX also stores additional metadata produced during quality evaluation in a `_dq_info` column. |
let's call it _dq_info needs to be updated in other places
| |----------------|---------|-------------------| | ||
| | `_errors` | Array of critical quality check failures | `errors` | | ||
| | `_warnings` | Array of warning-level quality check issues | `warnings` | | ||
| | `_info` | Structured metadata from dataset-level checks (e.g., anomaly detection) | `info` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | `_info` | Structured metadata from dataset-level checks (e.g., anomaly detection) | `info` | | |
| | `_info` | Structured metadata created when certain checks are used (e.g., anomaly detection) | `info` | |
| # Rename _info column to configured name if present (dataset-level checks like has_no_anomalies create it) | ||
| info_col_name = self._result_column_names[ColumnArguments.INFO] | ||
| if "_info" in result_df.columns and info_col_name != "_info": | ||
| result_df = result_df.withColumnRenamed("_info", info_col_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's only rename to _dq_info
|
|
||
| Built on [scikit-learn's Isolation Forest](https://scikit-learn.org/) with distributed Spark scoring, DQX provides production-ready anomaly detection with automatic feature engineering and explainability. | ||
|
|
||
| ## Complements Databricks native monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Complements Databricks native monitoring | |
| ## Complements Databricks data quality monitoring |
let's avoid the word, native, as dqx may be native soon as well
|
|
||
| ## Complements Databricks native monitoring | ||
|
|
||
| DQX's anomaly detection **works alongside** [Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DQX's anomaly detection **works alongside** [Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring: | |
| Databricks DQX's anomaly detection **works alongside** [Databricks data quality monitoring anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring: |
|
|
||
| DQX's anomaly detection **works alongside** [Databricks native anomaly detection](https://learn.microsoft.com/en-gb/azure/databricks/data-quality-monitoring/anomaly-detection/) for comprehensive monitoring: | ||
|
|
||
| | **Feature** | **Databricks Native** | **DQX Anomaly Detection** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | **Feature** | **Databricks Native** | **DQX Anomaly Detection** | | |
| | **Feature** | **Databricks Data Quality Monitoring** | **DQX Anomaly Detection** | |
|
|
||
| Built on [scikit-learn's Isolation Forest](https://scikit-learn.org/) with distributed Spark scoring, DQX provides production-ready anomaly detection with automatic feature engineering and explainability. | ||
|
|
||
| ## Complements Databricks native monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this in the Reference doc already. No need to duplicate the same info. I would remove it from the guide and keep it in the reference.
| 2. Add DQX anomaly checks for critical tables (model-based) | ||
| 3. Use both signals: late data alerts + row-level anomalies | ||
|
|
||
| ## Production-Ready Defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to Reference doc. This is a way too technical for a guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complete guide doc must be restructured. It's too technical. We only need info on how to train the model and configure the check, and how to analyze the results. All the rest can go to Reference.
| if not run_config.input_config: | ||
| raise InvalidConfigError("input_config is required to run the anomaly trainer workflow.") | ||
|
|
||
| if not anomaly_config.model_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think user should be configuring this. This should be internal to the library.
|
|
||
|
|
||
| @lru_cache(maxsize=1) | ||
| def _load_anomaly_check_funcs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be not needed. Please follow the same approach as for PII_ENABLED. If there are circular dependencies they must be solved.
| try: | ||
| return isinstance(value, expected_type) | ||
| except TypeError: | ||
| # For complex typing constructs (e.g., Callable, Protocol) that can't be validated at runtime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this? we should avoid silently ignoring type errors
| num_trees: int = 200 | ||
| max_depth: int | None = None | ||
| subsampling_rate: float | None = None | ||
| random_seed: int = 42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need seed by default?
|
|
||
| columns: list[str] | None = None # Auto-discovered if omitted | ||
| segment_by: list[str] | None = None # Auto-discovered if omitted (when columns also omitted) | ||
| model_name: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model_name: str | None = None |
can we please not expose it to the user, they shouldn't be configuring this themselves. I'm also not sure they should be changing params or temporal config. We could keep it, but I wouldn't talk about it in the docs.
| # Ensure the result DataFrame has the same columns as the input DataFrame + the new result column | ||
| return result_df.select(*df.columns, dest_col) | ||
| # Rename _info column to configured name if present (dataset-level checks like has_no_anomalies create it) | ||
| info_col_name = self._result_column_names[ColumnArguments.INFO] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too late, if user is using _info as a column name this will break before reaching this point
| from databricks.labs.dqx.installer.logs import TaskLogger | ||
|
|
||
| # Optional anomaly detection support | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update workflow installer to add the anomaly extras there
remote_wheels_with_extras = [f"{wheel}[llm,pii]" for wheel in remote_wheels]
| from databricks.labs.dqx.anomaly.anomaly_workflow import AnomalyTrainerWorkflow | ||
|
|
||
| ANOMALY_ENABLED = True | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never be available; this code will be run as part of the CLI insta so relevant libraries must be available. You could add them to the cli extras in pyproject.toml. But much better solution would be to follow pattern that we have for data contract (DATACONTRACT_ENABLED) or pii. We can try best effort import in the anomaly engine.
|
|
||
| df = read_input_data(ctx.spark, run_config.input_config) | ||
|
|
||
| ws = WorkspaceClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow other workflows and take workspace client from ctx.workspace_client
| DEFAULT_TRAIN_RATIO = 0.8 | ||
|
|
||
|
|
||
| class AnomalyEngine(DQEngineBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move to a separate module anomaly_engine for clarity
Changes
This PR adds ML-based anomaly detection to DQX, enabling users to detect unusual patterns in their data that can't be caught by traditional rule-based checks.
Key features:
has_no_anomalies()works like other DQX checksWhat's included:
AnomalyEnginefor training modelsResolves #957
Tests