Skip to content

Invalid Count - Add verbosity options and clarification #14

@GeekSheikh

Description

@GeekSheikh

The current Invalid Count report metric is confusing. ValidNumerics and ValidStrings use "collect_set" while Bounds rules utilize aggs and report back a 1 as well. These design decisions were made initially for performance on large datasets.

Use the RuleSet().validate function attribute of detailLevel to allow the user to specify the report detail level. Higher levels == longer run times but more detail. Great for dev stages.

* @param detailLevel -- For Future -- Perhaps faster way to just return true/false without
* processing everything and returning a report. For big data sets, perhaps run samples
* looking for invalids? Not sure how much faster and/or what the break-even would be
* @return Tuple of Dataframe report and final boolean of whether all rules were passed
*/
def validate(detailLevel: Int = 1): (DataFrame, Boolean) = {

val first = collect_set(rule.inputColumn).alias(rule.ruleName)
val results = Seq(invalid.cast(LongType).alias("Invalid_Count"), failed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions