Issue #243 - Add CustomSql analyzer#244
Conversation
| self.assertEqual(self.CountDistinct("b"), [Row(value=1.0)]) | ||
|
|
||
| def test_CustomSql(self): | ||
| self.df.createOrReplaceTempView("input_table") |
There was a problem hiding this comment.
EDGE_CASE: The test test_CustomSql relies on self.df.createOrReplaceTempView("input_table") being called within the test method, but the helper method CustomSql (line 115) calls self.AnalysisRunner.onData(self.df) which registers the DataFrame for analysis but does NOT register the temp view. If test_CustomSql runs before any other test that might register the view, it works. However, test_fail_CustomSql and test_fail_CustomSql_incorrect_query do NOT call createOrReplaceTempView, so they depend on the temp view persisting from test_CustomSql. If test execution order changes (e.g., running test_fail_CustomSql_incorrect_query in isolation), the temp view won't exist and the test will fail for the wrong reason (missing table, not incorrect query).
Line 309:
self.df.createOrReplaceTempView("input_table")is only intest_CustomSql. Lines 320-321 (test_fail_CustomSql) and 324-325 (test_fail_CustomSql_incorrect_query) use"input_table"without registering the view. ThesetUpClassmethod (lines 39-47 in full source) does not register a temp view.
| @pytest.mark.xfail(reason="@unittest.expectedFailure") | ||
| def test_fail_CustomSql(self): | ||
| self.assertEqual(self.CustomSql("SELECT SUM(b) FROM input_table"), [Row(value=1.0)]) | ||
|
|
There was a problem hiding this comment.
MISSING_TEST: The test test_fail_CustomSql_incorrect_query uses SELECT SUM(b) without a FROM clause. This is expected to fail, but the xfail marker means the test passes if an exception is raised. However, there's no test that verifies the correct failure mode (e.g., that a specific exception type is raised or that the metric result is NaN/empty). A proper negative test should use pytest.raises or check the specific error behavior rather than relying on any exception being acceptable.
Line 324-325:
@pytest.mark.xfail(reason="@unittest.expectedFailure")followed byself.CustomSql("SELECT SUM(b)")- this will pass as long as ANY exception is raised, not specifically validating the expected behavior for an invalid SQL query.
| :return self | ||
| """ | ||
| return self._deequAnalyzers.CustomSql(self.expression, self.disambiguator) | ||
|
|
There was a problem hiding this comment.
EDGE_CASE: The CustomSql analyzer passes self.expression and self.disambiguator directly to the Scala CustomSql constructor. However, looking at the Deequ Scala source for CustomSql, the constructor signature is CustomSql(expression: String, disambiguator: Option[String]) in some versions. If the Scala API expects an Option[String] for disambiguator, passing a plain Python string would cause a Py4J type mismatch error at runtime.
Line 385:
return self._deequAnalyzers.CustomSql(self.expression, self.disambiguator)passesself.disambiguatoras a raw string. Other analyzers in this file wrap optional parameters withself._jvm.scala.Option.apply(...)(e.g., line 181 for ApproxCountDistinct:self._jvm.scala.Option.apply(self.where)). If the Scala CustomSql expectsOption[String], this will fail at runtime.
Issue #, if available:
#243
Description of changes:
Add a Python wrapper for the existing CustomSql analyzer from the Scala implementation
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.