Update handling of metadata columns during schema validation #1002

ghanse · 2026-01-19T20:30:03Z

Changes

This PR adds internal methods in DQEngine which handle metadata columns during dataset-level checks (e.g. has_valid_schema).

Linked issues

Resolves #989

Tests

github-actions · 2026-01-19T20:59:27Z

✅ 507/507 passed, 2 flaky, 41 skipped, 3h43m22s total

Flaky tests:

🤪 test_observer_custom_metrics[apply_checks_by_metadata] (1.288s)
🤪 test_e2e_workflow_serverless (10m35.634s)

_{Running from acceptance #3644}

docs/dqx/docs/reference/quality_checks.mdx

src/databricks/labs/dqx/engine.py

Copilot

Pull request overview

This PR adds support for handling metadata columns (result columns added by DQEngine) during schema validation by introducing an ignore_columns parameter to the has_valid_schema check function. The engine automatically adds result column names to the ignore list when performing schema validation, preventing false positives when result columns are included in the DataFrame schema.

Changes:

Added ignore_columns parameter to has_valid_schema function to allow excluding specific columns from schema validation
Added _get_checks_with_ignored_result_columns method in DQEngine to automatically ignore result columns during schema validation
Added integration test for the new ignore_columns parameter functionality
Updated documentation to reflect the new parameter

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
src/databricks/labs/dqx/engine.py	Added method to automatically append result column names to ignore_columns for schema validation checks, and imported check_funcs module
src/databricks/labs/dqx/check_funcs.py	Added ignore_columns parameter to has_valid_schema function to support excluding columns from schema comparison
tests/integration/test_dataset_checks.py	Added integration test for ignore_columns parameter in has_valid_schema function
docs/dqx/docs/reference/quality_checks.mdx	Updated documentation with examples and parameter description for ignore_columns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/integration/test_dataset_checks.py

src/databricks/labs/dqx/engine.py

src/databricks/labs/dqx/check_funcs.py

docs/dqx/docs/reference/quality_checks.mdx

src/databricks/labs/dqx/engine.py

codecov · 2026-01-21T16:08:37Z

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.53%. Comparing base (ae7b0d3) to head (d2f88ca).

Files with missing lines	Patch %	Lines
src/databricks/labs/dqx/engine.py	81.81%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1002      +/-   ##
==========================================
- Coverage   90.60%   90.53%   -0.07%     
==========================================
  Files          64       64              
  Lines        6526     6543      +17     
==========================================
+ Hits         5913     5924      +11     
- Misses        613      619       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/databricks/labs/dqx/engine.py

refactor

mwojtyczka · 2026-01-22T08:27:17Z

src/databricks/labs/dqx/engine.py

+        if check.check_func_kwargs.get("columns"):
+            return check
+
+        if check.check_func_args and len(check.check_func_args) >= 3:


I would remove this condition to make it more generic. We will have to use this function for anomaly detection as well, which requires a list of columns.

There are a couple of issues with using check_func_args here:

makes an assumption of the underlying implementation. Hard to maintain.

makes an assumption that the function is executed within the context of has_valid_schema. So there is potential for misuse.

mwojtyczka · 2026-01-22T08:29:49Z

tests/integration/test_dataset_checks.py

+    )
+
+    expected_schema = "a string, b int, c double"
+    condition, apply_method = has_valid_schema(expected_schema, ignore_columns=["d"], strict=True)


provide also case where column is provided as spark col to increase test coverage: ignore_columns=[F.col("d")]

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T08:34:18Z

src/databricks/labs/dqx/engine.py

+        # Default columns to all columns of the current DataFrame if not explicitly set
+        if check.check_func_kwargs.get("columns"):
+            return check
+


The early return conditions lack comments explaining their purpose. Add a comment for line 366-367 explaining why checks with 3 or more positional arguments are returned unmodified, similar to the comment on line 362.

Suggested change

# Respect checks that pass columns (or equivalent) as the 3rd+ positional argument;

# do not override their explicitly provided column selection.

Copilot · 2026-01-22T08:34:18Z

src/databricks/labs/dqx/check_funcs.py

-        actual_schema = df.select(*columns).schema if columns else df.schema
+        selected_column_names = column_names if column_names else df.columns
+        if ignore_column_names:
+            selected_column_names = [col for col in selected_column_names if col not in ignore_column_names]


The list comprehension on line 2019 performs O(n*m) lookups where n is the number of selected columns and m is the number of ignore columns. Convert ignore_column_names to a set before the list comprehension for O(n) performance: ignore_set = set(ignore_column_names) then use if col not in ignore_set.

Suggested change

selected_column_names = [col for col in selected_column_names if col not in ignore_column_names]

ignore_set = set(ignore_column_names)

selected_column_names = [col for col in selected_column_names if col not in ignore_set]

good feedback

mwojtyczka · 2026-01-22T08:39:44Z

src/databricks/labs/dqx/engine.py

        """Check if all elements in the checks list are instances of DQRule."""
        return all(isinstance(check, DQRule) for check in checks)

+    def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule:


Suggested change

def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule:

def _preselect_original_columns(self, df: DataFrame, check: DQRule) -> DQRule:

I would rename to make it more generic as there will be other checks we want to use it for (e.g. anomaly detection)

mwojtyczka · 2026-01-22T08:51:15Z

src/databricks/labs/dqx/engine.py

        current_df = df

        for check in checks:
+            normalized_check = (


I think we should make it more generic, since more checks will require this in the future. This will also allow the creation of custom functions that operate on the full schema.

@register_rule("dataset") @full_schema_rule # -> register in FULL_SCHEMA_CHECK_FUNC_REGISTRY def has_valid_schema( ... @register_rule("dataset") @full_schema_rule def has_no_anomalies( ... # any custom check function @register_rule("dataset") @full_schema_rule def custom_func(

normalized_check = ( self._preselect_schema_validation_columns(df, check) if check.check_func in FULL_SCHEMA_CHECK_FUNC_REGISTRY else check )

Update handling of metadata columns during schema validation

3c5aae6

ghanse requested a review from a team as a code owner January 19, 2026 20:30

ghanse requested review from tombonfert and removed request for a team January 19, 2026 20:30

ghanse temporarily deployed to tool January 19, 2026 20:30 — with GitHub Actions Inactive

ghanse had a problem deploying to tool January 19, 2026 20:30 — with GitHub Actions Failure

mwojtyczka requested changes Jan 20, 2026

View reviewed changes

mwojtyczka requested a review from Copilot January 20, 2026 11:37

Copilot started reviewing on behalf of mwojtyczka January 20, 2026 11:38 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

Update implementation, docs, and tests

0e8d61b

ghanse temporarily deployed to tool January 20, 2026 16:00 — with GitHub Actions Inactive

ghanse had a problem deploying to tool January 20, 2026 16:00 — with GitHub Actions Failure

ghanse had a problem deploying to tool January 20, 2026 16:00 — with GitHub Actions Error

ghanse temporarily deployed to tool January 20, 2026 16:00 — with GitHub Actions Inactive

Fix formatting

49624f3

ghanse temporarily deployed to tool January 20, 2026 16:20 — with GitHub Actions Inactive

ghanse had a problem deploying to tool January 20, 2026 17:45 — with GitHub Actions Failure

ghanse temporarily deployed to tool January 20, 2026 17:45 — with GitHub Actions Inactive

ghanse had a problem deploying to tool January 20, 2026 17:45 — with GitHub Actions Failure

ghanse temporarily deployed to tool January 20, 2026 17:45 — with GitHub Actions Inactive

Fix implementation and tests

64aa561

ghanse had a problem deploying to tool January 20, 2026 19:47 — with GitHub Actions Error

ghanse had a problem deploying to tool January 20, 2026 19:48 — with GitHub Actions Failure

ghanse temporarily deployed to tool January 20, 2026 19:48 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool January 20, 2026 23:42 — with GitHub Actions Failure

mwojtyczka temporarily deployed to tool January 20, 2026 23:42 — with GitHub Actions Inactive

Fix tests

8083221

ghanse temporarily deployed to tool January 21, 2026 15:05 — with GitHub Actions Inactive

Refactor

02291ea

ghanse temporarily deployed to tool January 21, 2026 22:18 — with GitHub Actions Inactive

mwojtyczka reviewed Jan 22, 2026

View reviewed changes

src/databricks/labs/dqx/engine.py Outdated Show resolved Hide resolved

Apply suggestion from @mwojtyczka

d2f88ca

refactor

mwojtyczka temporarily deployed to tool January 22, 2026 08:16 — with GitHub Actions Inactive

mwojtyczka reviewed Jan 22, 2026

View reviewed changes

mwojtyczka requested changes Jan 22, 2026

View reviewed changes

mwojtyczka requested a review from Copilot January 22, 2026 08:30

Copilot AI reviewed Jan 22, 2026

View reviewed changes

mwojtyczka reviewed Jan 22, 2026

View reviewed changes



	# Respect checks that pass columns (or equivalent) as the 3rd+ positional argument;
	# do not override their explicitly provided column selection.

	selected_column_names = [col for col in selected_column_names if col not in ignore_column_names]
	ignore_set = set(ignore_column_names)
	selected_column_names = [col for col in selected_column_names if col not in ignore_set]

	def _preselect_schema_validation_columns(self, df: DataFrame, check: DQRule) -> DQRule:
	def _preselect_original_columns(self, df: DataFrame, check: DQRule) -> DQRule:

Update handling of metadata columns during schema validation #1002

Are you sure you want to change the base?

Update handling of metadata columns during schema validation #1002

Conversation

ghanse commented Jan 19, 2026

Changes

Linked issues

Tests

Uh oh!

github-actions bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

mwojtyczka Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 19, 2026 •

edited

Loading

codecov bot commented Jan 21, 2026 •

edited

Loading

mwojtyczka Jan 22, 2026 •

edited

Loading

mwojtyczka Jan 22, 2026 •

edited

Loading

mwojtyczka Jan 22, 2026 •

edited

Loading