Skip to content

Conversation

@ghanse
Copy link
Contributor

@ghanse ghanse commented Jan 19, 2026

Changes

This PR adds internal methods in DQEngine which handle metadata columns during dataset-level checks (e.g. has_valid_schema).

Linked issues

Resolves #989

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@github-actions
Copy link

github-actions bot commented Jan 19, 2026

✅ 507/507 passed, 2 flaky, 41 skipped, 3h43m22s total

Flaky tests:

  • 🤪 test_observer_custom_metrics[apply_checks_by_metadata] (1.288s)
  • 🤪 test_e2e_workflow_serverless (10m35.634s)

Running from acceptance #3644

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for handling metadata columns (result columns added by DQEngine) during schema validation by introducing an ignore_columns parameter to the has_valid_schema check function. The engine automatically adds result column names to the ignore list when performing schema validation, preventing false positives when result columns are included in the DataFrame schema.

Changes:

  • Added ignore_columns parameter to has_valid_schema function to allow excluding specific columns from schema validation
  • Added _get_checks_with_ignored_result_columns method in DQEngine to automatically ignore result columns during schema validation
  • Added integration test for the new ignore_columns parameter functionality
  • Updated documentation to reflect the new parameter

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/databricks/labs/dqx/engine.py Added method to automatically append result column names to ignore_columns for schema validation checks, and imported check_funcs module
src/databricks/labs/dqx/check_funcs.py Added ignore_columns parameter to has_valid_schema function to support excluding columns from schema comparison
tests/integration/test_dataset_checks.py Added integration test for ignore_columns parameter in has_valid_schema function
docs/dqx/docs/reference/quality_checks.mdx Updated documentation with examples and parameter description for ignore_columns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.53%. Comparing base (ae7b0d3) to head (d2f88ca).

Files with missing lines Patch % Lines
src/databricks/labs/dqx/engine.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1002      +/-   ##
==========================================
- Coverage   90.60%   90.53%   -0.07%     
==========================================
  Files          64       64              
  Lines        6526     6543      +17     
==========================================
+ Hits         5913     5924      +11     
- Misses        613      619       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if check.check_func_kwargs.get("columns"):
return check

if check.check_func_args and len(check.check_func_args) >= 3:
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this condition to make it more generic. We will have to use this function for anomaly detection as well, which requires a list of columns.

There are a couple of issues with using check_func_args here:

  • makes an assumption of the underlying implementation. Hard to maintain.
  • makes an assumption that the function is executed within the context of has_valid_schema. So there is potential for misuse.

)

expected_schema = "a string, b int, c double"
condition, apply_method = has_valid_schema(expected_schema, ignore_columns=["d"], strict=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide also case where column is provided as spark col to increase test coverage: ignore_columns=[F.col("d")]

@mwojtyczka mwojtyczka requested a review from Copilot January 22, 2026 08:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Default columns to all columns of the current DataFrame if not explicitly set
if check.check_func_kwargs.get("columns"):
return check

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return conditions lack comments explaining their purpose. Add a comment for line 366-367 explaining why checks with 3 or more positional arguments are returned unmodified, similar to the comment on line 362.

Suggested change
# Respect checks that pass columns (or equivalent) as the 3rd+ positional argument;
# do not override their explicitly provided column selection.

Copilot uses AI. Check for mistakes.
actual_schema = df.select(*columns).schema if columns else df.schema
selected_column_names = column_names if column_names else df.columns
if ignore_column_names:
selected_column_names = [col for col in selected_column_names if col not in ignore_column_names]
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list comprehension on line 2019 performs O(n*m) lookups where n is the number of selected columns and m is the number of ignore columns. Convert ignore_column_names to a set before the list comprehension for O(n) performance: ignore_set = set(ignore_column_names) then use if col not in ignore_set.

Suggested change
selected_column_names = [col for col in selected_column_names if col not in ignore_column_names]
ignore_set = set(ignore_column_names)
selected_column_names = [col for col in selected_column_names if col not in ignore_set]

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good feedback

"""Check if all elements in the checks list are instances of DQRule."""
return all(isinstance(check, DQRule) for check in checks)

def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule:
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule:
def _preselect_original_columns(self, df: DataFrame, check: DQRule) -> DQRule:

I would rename to make it more generic as there will be other checks we want to use it for (e.g. anomaly detection)

current_df = df

for check in checks:
normalized_check = (
Copy link
Contributor

@mwojtyczka mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it more generic, since more checks will require this in the future. This will also allow the creation of custom functions that operate on the full schema.

@register_rule("dataset")
@full_schema_rule  # -> register in FULL_SCHEMA_CHECK_FUNC_REGISTRY
def has_valid_schema(
...

@register_rule("dataset")
@full_schema_rule 
def has_no_anomalies(
...

# any custom check function
@register_rule("dataset")
@full_schema_rule 
def custom_func(
normalized_check = (
    self._preselect_schema_validation_columns(df, check)
    if check.check_func in FULL_SCHEMA_CHECK_FUNC_REGISTRY
    else check
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: has_valid_schema not handling result columns properly

3 participants