Skip to content

Commit ba99267

Browse files
Update ValidatorTestSuite (#19)
* Update Validator tests with API changes. * Add tests for implicit and explicit expression rules. * imported outstanding spark sql functions * Add test suite for Rules class. * Add tests for RuleSet class. * Add test for complex expressions on aggregates. * Fix isGrouped bug when groupBys array is empty by default or explicitly set. * Fix overloaded add function that merges 2 RuleSets. * Add ignoreCase and invertMatch to ValidateStrings and ValidateNumerics rule types. * Update documentation with latest features in categorical Rules. Co-authored-by: Daniel Tomes [GeekSheikh] <10840635+geeksheikh@users.noreply.github.com>
1 parent bb76213 commit ba99267

File tree

11 files changed

+996
-298
lines changed

11 files changed

+996
-298
lines changed

README.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,8 @@ someRuleSet.addMinMaxRules("Retail_Price_Validation", col("retail_price"), Bound
174174
### Categorical Rules
175175
There are two types of categorical rules which are used to validate against a pre-defined list of valid
176176
values. As of 0.2 accepted categorical types are String, Double, Int, Long but any types outside of this can
177-
be input as an array() column of any type so long as it can be evaulated against the intput column
177+
be input as an array() column of any type so long as it can be evaluated against the input column.
178+
178179
```scala
179180
val catNumerics = Array(
180181
Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs),
@@ -187,6 +188,18 @@ Rule("Valid_Regions", col("region"), Lookups.validRegions)
187188
)
188189
```
189190

191+
An optional `ignoreCase` parameter can be specified when evaluating against a list of String values to ignore or apply
192+
case-sensitivity. By default, input columns will be evaluated against a list of Strings with case-sensitivity applied.
193+
```scala
194+
Rule("Valid_Regions", col("region"), Lookups.validRegions, ignoreCase=true)
195+
```
196+
197+
Furthermore, the evaluation of categorical rules can be inverted by specifying `invertMatch=true` as a parameter.
198+
This can be handy when defining a Rule that an input column cannot match list of invalid values. For example:
199+
```scala
200+
Rule("Invalid_Skus", col("sku"), Lookups.invalidSkus, invertMatch=true)
201+
```
202+
190203
### Validation
191204
Now that you have some rules built up... it's time to build the ruleset and validate it. As mentioned above,
192205
the dataframe can be a simple df or a grouped df by passing column[s] to perform validation at the

demo/Example.scala

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,12 @@ object Example extends App with SparkSessionWrapper {
5050

5151
val catNumerics = Array(
5252
Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs),
53-
Rule("Valid_Skus", col("sku"), Lookups.validSkus)
53+
Rule("Valid_Skus", col("sku"), Lookups.validSkus),
54+
Rule("Invalid_Skus", col("sku"), Lookups.invalidSkus, invertMatch=true)
5455
)
5556

5657
val catStrings = Array(
57-
Rule("Valid_Regions", col("region"), Lookups.validRegions)
58+
Rule("Valid_Regions", col("region"), Lookups.validRegions, ignoreCase=true)
5859
)
5960

6061
//TODO - validate datetime
@@ -76,18 +77,18 @@ object Example extends App with SparkSessionWrapper {
7677
.withColumn("create_dt", 'create_ts.cast("date"))
7778

7879
// Doing the validation
79-
// The validate method will return the rules report dataframe which breaks down which rules passed and which
80-
// rules failed and how/why. The second return value returns a boolean to determine whether or not all tests passed
81-
// val (rulesReport, passed) = RuleSet(df, Array("store_id"))
82-
val (rulesReport, passed) = RuleSet(df)
80+
// The validate method will return two reports - a complete report and a summary report.
81+
// The complete report is verbose and will add all rule validations to the right side of the original
82+
// df passed into RuleSet, while the summary report will contain all of the rows that failed one or more
83+
// Rule evaluations.
84+
val validationResults = RuleSet(df)
8385
.add(specializedRules)
8486
.add(minMaxPriceRules)
8587
.add(catNumerics)
8688
.add(catStrings)
87-
.validate(2)
89+
.validate()
8890

89-
rulesReport.show(200, false)
90-
// rulesReport.printSchema()
91+
validationResults.completeReport.show(200, false)
9192

9293

9394
}

demo/Rules_Engine_Examples.dbc

192 Bytes
Binary file not shown.

demo/Rules_Engine_Examples.html

Lines changed: 25 additions & 24 deletions
Large diffs are not rendered by default.

src/main/scala/com/databricks/labs/validation/Rule.scala

Lines changed: 64 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,8 @@
11
package com.databricks.labs.validation
22

3-
import com.databricks.labs.validation.utils.Structures.{Bounds, ValidationException}
3+
import com.databricks.labs.validation.utils.Structures.Bounds
44
import org.apache.spark.sql.Column
55
import org.apache.spark.sql.functions.{array, lit}
6-
import org.apache.spark.sql.types.BooleanType
7-
8-
import java.util.UUID
96

107
/**
118
* Definition of a rule
@@ -21,6 +18,8 @@ class Rule(
2118
private var _validNumerics: Column = array(lit(null).cast("double"))
2219
private var _validStrings: Column = array(lit(null).cast("string"))
2320
private var _implicitBoolean: Boolean = false
21+
private var _ignoreCase: Boolean = false
22+
private var _invertMatch: Boolean = false
2423
val inputColumnName: String = inputColumn.expr.toString().replace("'", "")
2524

2625
override def toString: String = {
@@ -47,8 +46,8 @@ class Rule(
4746
this
4847
}
4948

50-
private def setValidStrings(value: Array[String]): this.type = {
51-
_validStrings = lit(value)
49+
private def setValidStrings(value: Array[String], ignoreCase: Boolean): this.type = {
50+
_validStrings = if(ignoreCase) lit(value.map(_.toLowerCase)) else lit(value)
5251
inputColumn.expr.children.map(_.prettyName)
5352
this
5453
}
@@ -63,6 +62,16 @@ class Rule(
6362
this
6463
}
6564

65+
private def setIgnoreCase(value: Boolean): this.type = {
66+
_ignoreCase = value
67+
this
68+
}
69+
70+
private def setInvertMatch(value: Boolean): this.type = {
71+
_invertMatch = value
72+
this
73+
}
74+
6675
def boundaries: Bounds = _boundaries
6776

6877
def validNumerics: Column = _validNumerics
@@ -73,6 +82,10 @@ class Rule(
7382

7483
def isImplicitBool: Boolean = _implicitBoolean
7584

85+
def ignoreCase: Boolean = _ignoreCase
86+
87+
def invertMatch: Boolean = _invertMatch
88+
7689
def isAgg: Boolean = {
7790
inputColumn.expr.prettyName == "aggregateexpression" ||
7891
inputColumn.expr.children.map(_.prettyName).contains("aggregateexpression")
@@ -114,6 +127,18 @@ object Rule {
114127
.setValidExpr(validExpr)
115128
}
116129

130+
def apply(
131+
ruleName: String,
132+
column: Column,
133+
validNumerics: Array[Double],
134+
invertMatch: Boolean
135+
): Rule = {
136+
137+
new Rule(ruleName, column, RuleType.ValidateNumerics)
138+
.setValidNumerics(validNumerics)
139+
.setInvertMatch(invertMatch)
140+
}
141+
117142
def apply(
118143
ruleName: String,
119144
column: Column,
@@ -122,6 +147,19 @@ object Rule {
122147

123148
new Rule(ruleName, column, RuleType.ValidateNumerics)
124149
.setValidNumerics(validNumerics)
150+
.setInvertMatch(false)
151+
}
152+
153+
def apply(
154+
ruleName: String,
155+
column: Column,
156+
validNumerics: Array[Long],
157+
invertMatch: Boolean
158+
): Rule = {
159+
160+
new Rule(ruleName, column, RuleType.ValidateNumerics)
161+
.setValidNumerics(validNumerics.map(_.toString.toDouble))
162+
.setInvertMatch(invertMatch)
125163
}
126164

127165
def apply(
@@ -132,6 +170,19 @@ object Rule {
132170

133171
new Rule(ruleName, column, RuleType.ValidateNumerics)
134172
.setValidNumerics(validNumerics.map(_.toString.toDouble))
173+
.setInvertMatch(false)
174+
}
175+
176+
def apply(
177+
ruleName: String,
178+
column: Column,
179+
validNumerics: Array[Int],
180+
invertMatch: Boolean
181+
): Rule = {
182+
183+
new Rule(ruleName, column, RuleType.ValidateNumerics)
184+
.setValidNumerics(validNumerics.map(_.toString.toDouble))
185+
.setInvertMatch(invertMatch)
135186
}
136187

137188
def apply(
@@ -142,16 +193,21 @@ object Rule {
142193

143194
new Rule(ruleName, column, RuleType.ValidateNumerics)
144195
.setValidNumerics(validNumerics.map(_.toString.toDouble))
196+
.setInvertMatch(false)
145197
}
146198

147199
def apply(
148200
ruleName: String,
149201
column: Column,
150-
validStrings: Array[String]
202+
validStrings: Array[String],
203+
ignoreCase: Boolean = false,
204+
invertMatch: Boolean = false
151205
): Rule = {
152206

153207
new Rule(ruleName, column, RuleType.ValidateStrings)
154-
.setValidStrings(validStrings)
208+
.setValidStrings(validStrings, ignoreCase)
209+
.setIgnoreCase(ignoreCase)
210+
.setInvertMatch(invertMatch)
155211
}
156212

157213
}

src/main/scala/com/databricks/labs/validation/RuleSet.scala

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ class RuleSet extends SparkSessionWrapper {
3535

3636
private def setGroupByCols(value: Seq[String]): this.type = {
3737
_groupBys = value
38-
_isGrouped = true
38+
_isGrouped = value.nonEmpty
3939
this
4040
}
4141

@@ -110,15 +110,16 @@ class RuleSet extends SparkSessionWrapper {
110110
}
111111

112112
/**
113-
* Merge two rule sets by adding one rule set to another
114-
*
115-
* @param ruleSet RuleSet to be added
116-
* @return RuleSet
117-
*/
113+
* Merge two rule sets by adding one rule set to another
114+
*
115+
* @param ruleSet RuleSet to be added
116+
* @return RuleSet
117+
*/
118118
def add(ruleSet: RuleSet): RuleSet = {
119-
new RuleSet().setDF(ruleSet.getDf)
120-
.setIsGrouped(ruleSet.isGrouped)
121-
.add(ruleSet.getRules)
119+
val addtnlGroupBys = ruleSet.getGroupBys diff this.getGroupBys
120+
val mergedGroupBys = this.getGroupBys ++ addtnlGroupBys
121+
this.add(ruleSet.getRules)
122+
.setGroupByCols(mergedGroupBys)
122123
}
123124

124125
/**

src/main/scala/com/databricks/labs/validation/Validator.scala

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,16 +29,19 @@ class Validator(ruleSet: RuleSet, detailLvl: Int) extends SparkSessionWrapper {
2929
rule.inputColumn.cast("string").alias("actual")
3030
).alias(rule.ruleName)
3131
case RuleType.ValidateNumerics =>
32+
val ruleExpr = if(rule.invertMatch) not(array_contains(rule.validNumerics, rule.inputColumn)) else array_contains(rule.validNumerics, rule.inputColumn)
3233
struct(
3334
lit(rule.ruleName).alias("ruleName"),
34-
array_contains(rule.validNumerics, rule.inputColumn).alias("passed"),
35+
ruleExpr.alias("passed"),
3536
rule.validNumerics.cast("string").alias("permitted"),
3637
rule.inputColumn.cast("string").alias("actual")
3738
).alias(rule.ruleName)
3839
case RuleType.ValidateStrings =>
40+
val ruleValue = if(rule.ignoreCase) lower(rule.inputColumn) else rule.inputColumn
41+
val ruleExpr = if(rule.invertMatch) not(array_contains(rule.validStrings, ruleValue)) else array_contains(rule.validStrings, ruleValue)
3942
struct(
4043
lit(rule.ruleName).alias("ruleName"),
41-
array_contains(rule.validStrings, rule.inputColumn).alias("passed"),
44+
ruleExpr.alias("passed"),
4245
rule.validStrings.cast("string").alias("permitted"),
4346
rule.inputColumn.cast("string").alias("actual")
4447
).alias(rule.ruleName)

src/main/scala/com/databricks/labs/validation/utils/Structures.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@ object Lookups {
1313

1414
final val validRegions = Array("Northeast", "Southeast", "Midwest", "Northwest", "Southcentral", "Southwest")
1515

16-
final val validSkus = Array(123456, 122987,123256, 173544, 163212, 365423, 168212)
16+
final val validSkus = Array(123456, 122987, 123256, 173544, 163212, 365423, 168212)
17+
18+
final val invalidSkus = Array(9123456, 9122987, 9123256, 9173544, 9163212, 9365423, 9168212)
1719

1820
}
1921

0 commit comments

Comments
 (0)