-
Notifications
You must be signed in to change notification settings - Fork 31
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
The current Invalid Count report metric is confusing. ValidNumerics and ValidStrings use "collect_set" while Bounds rules utilize aggs and report back a 1 as well. These design decisions were made initially for performance on large datasets.
Use the RuleSet().validate function attribute of detailLevel to allow the user to specify the report detail level. Higher levels == longer run times but more detail. Great for dev stages.
dataframe-rules-engine/src/main/scala/com/databricks/labs/validation/RuleSet.scala
Lines 132 to 137 in 72da2c7
| * @param detailLevel -- For Future -- Perhaps faster way to just return true/false without | |
| * processing everything and returning a report. For big data sets, perhaps run samples | |
| * looking for invalids? Not sure how much faster and/or what the break-even would be | |
| * @return Tuple of Dataframe report and final boolean of whether all rules were passed | |
| */ | |
| def validate(detailLevel: Int = 1): (DataFrame, Boolean) = { |
dataframe-rules-engine/src/main/scala/com/databricks/labs/validation/Validator.scala
Lines 148 to 149 in 72da2c7
| val first = collect_set(rule.inputColumn).alias(rule.ruleName) | |
| val results = Seq(invalid.cast(LongType).alias("Invalid_Count"), failed) |
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request