Skip to content

[FEATURE]: dataset level checks next to row level checks #1150

@ptab0211

Description

@ptab0211

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

DQX currently supports dataset-level checks, for example aggregate checks over the whole DataFrame. However, when a dataset-level check fails, the resulting _errors or _warnings are attached to every row in the input DataFrame.

This is technically understandable because the whole dataset violates the rule, but it creates a lot of noise when using quarantine workflows. For example, if I define a dataset-level check such as “the percentage of rows with target_label = 1 must be above a threshold,” and the threshold is not met, every row is emitted with the same warning/error. That makes the quarantine table look like every individual row is bad, even though the failure is really a table-level metric failure.

It would be useful to have first-class dataset-level result handling, where dataset-level checks can produce a single check result per run/check instead of attaching the result to every row.

Proposed Solution

dataset level checks output

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions