Invalid Count - Add verbosity options and clarification

The current `Invalid Count` report metric is confusing. ValidNumerics and ValidStrings use "collect_set" while Bounds rules utilize aggs and report back a 1 as well. These design decisions were made initially for performance on large datasets. 

Use the `RuleSet().validate` function attribute of `detailLevel` to allow the user to specify the report detail level. Higher levels == longer run times but more detail. Great for dev stages.

https://github.com/databrickslabs/dataframe-rules-engine/blob/72da2c71b4b3a26a57c9ff3199650a2e02923730/src/main/scala/com/databricks/labs/validation/RuleSet.scala#L132-L137

https://github.com/databrickslabs/dataframe-rules-engine/blob/72da2c71b4b3a26a57c9ff3199650a2e02923730/src/main/scala/com/databricks/labs/validation/Validator.scala#L148-L149

	* @param detailLevel -- For Future -- Perhaps faster way to just return true/false without
	* processing everything and returning a report. For big data sets, perhaps run samples
	* looking for invalids? Not sure how much faster and/or what the break-even would be
	* @return Tuple of Dataframe report and final boolean of whether all rules were passed
	*/
	def validate(detailLevel: Int = 1): (DataFrame, Boolean) = {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Invalid Count - Add verbosity options and clarification #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	val first = collect_set(rule.inputColumn).alias(rule.ruleName)
	val results = Seq(invalid.cast(LongType).alias("Invalid_Count"), failed)

Invalid Count - Add verbosity options and clarification #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions