-
Notifications
You must be signed in to change notification settings - Fork 77
Update handling of metadata columns during schema validation #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ 507/507 passed, 2 flaky, 41 skipped, 3h43m22s total Flaky tests:
Running from acceptance #3644 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for handling metadata columns (result columns added by DQEngine) during schema validation by introducing an ignore_columns parameter to the has_valid_schema check function. The engine automatically adds result column names to the ignore list when performing schema validation, preventing false positives when result columns are included in the DataFrame schema.
Changes:
- Added
ignore_columnsparameter tohas_valid_schemafunction to allow excluding specific columns from schema validation - Added
_get_checks_with_ignored_result_columnsmethod inDQEngineto automatically ignore result columns during schema validation - Added integration test for the new
ignore_columnsparameter functionality - Updated documentation to reflect the new parameter
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/databricks/labs/dqx/engine.py | Added method to automatically append result column names to ignore_columns for schema validation checks, and imported check_funcs module |
| src/databricks/labs/dqx/check_funcs.py | Added ignore_columns parameter to has_valid_schema function to support excluding columns from schema comparison |
| tests/integration/test_dataset_checks.py | Added integration test for ignore_columns parameter in has_valid_schema function |
| docs/dqx/docs/reference/quality_checks.mdx | Updated documentation with examples and parameter description for ignore_columns |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1002 +/- ##
==========================================
- Coverage 90.60% 90.53% -0.07%
==========================================
Files 64 64
Lines 6526 6543 +17
==========================================
+ Hits 5913 5924 +11
- Misses 613 619 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
refactor
| if check.check_func_kwargs.get("columns"): | ||
| return check | ||
|
|
||
| if check.check_func_args and len(check.check_func_args) >= 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this condition to make it more generic. We will have to use this function for anomaly detection as well, which requires a list of columns.
There are a couple of issues with using check_func_args here:
- makes an assumption of the underlying implementation. Hard to maintain.
- makes an assumption that the function is executed within the context of has_valid_schema. So there is potential for misuse.
| ) | ||
|
|
||
| expected_schema = "a string, b int, c double" | ||
| condition, apply_method = has_valid_schema(expected_schema, ignore_columns=["d"], strict=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provide also case where column is provided as spark col to increase test coverage: ignore_columns=[F.col("d")]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Default columns to all columns of the current DataFrame if not explicitly set | ||
| if check.check_func_kwargs.get("columns"): | ||
| return check | ||
|
|
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early return conditions lack comments explaining their purpose. Add a comment for line 366-367 explaining why checks with 3 or more positional arguments are returned unmodified, similar to the comment on line 362.
| # Respect checks that pass columns (or equivalent) as the 3rd+ positional argument; | |
| # do not override their explicitly provided column selection. |
| actual_schema = df.select(*columns).schema if columns else df.schema | ||
| selected_column_names = column_names if column_names else df.columns | ||
| if ignore_column_names: | ||
| selected_column_names = [col for col in selected_column_names if col not in ignore_column_names] |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension on line 2019 performs O(n*m) lookups where n is the number of selected columns and m is the number of ignore columns. Convert ignore_column_names to a set before the list comprehension for O(n) performance: ignore_set = set(ignore_column_names) then use if col not in ignore_set.
| selected_column_names = [col for col in selected_column_names if col not in ignore_column_names] | |
| ignore_set = set(ignore_column_names) | |
| selected_column_names = [col for col in selected_column_names if col not in ignore_set] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good feedback
| """Check if all elements in the checks list are instances of DQRule.""" | ||
| return all(isinstance(check, DQRule) for check in checks) | ||
|
|
||
| def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule: | |
| def _preselect_original_columns(self, df: DataFrame, check: DQRule) -> DQRule: |
I would rename to make it more generic as there will be other checks we want to use it for (e.g. anomaly detection)
| current_df = df | ||
|
|
||
| for check in checks: | ||
| normalized_check = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make it more generic, since more checks will require this in the future. This will also allow the creation of custom functions that operate on the full schema.
@register_rule("dataset")
@full_schema_rule # -> register in FULL_SCHEMA_CHECK_FUNC_REGISTRY
def has_valid_schema(
...
@register_rule("dataset")
@full_schema_rule
def has_no_anomalies(
...
# any custom check function
@register_rule("dataset")
@full_schema_rule
def custom_func(
normalized_check = (
self._preselect_schema_validation_columns(df, check)
if check.check_func in FULL_SCHEMA_CHECK_FUNC_REGISTRY
else check
)
Changes
This PR adds internal methods in
DQEnginewhich handle metadata columns during dataset-level checks (e.g.has_valid_schema).Linked issues
Resolves #989
Tests