Skip to content

Issue #243 - Add CustomSql analyzer#244

Open
wwitkowski wants to merge 4 commits into
awslabs:masterfrom
wwitkowski:feature/add-customsql-analyzer
Open

Issue #243 - Add CustomSql analyzer#244
wwitkowski wants to merge 4 commits into
awslabs:masterfrom
wwitkowski:feature/add-customsql-analyzer

Conversation

@wwitkowski
Copy link
Copy Markdown

Issue #, if available:
#243

Description of changes:
Add a Python wrapper for the existing CustomSql analyzer from the Scala implementation

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@wwitkowski wwitkowski marked this pull request as ready for review May 30, 2025 11:04
@wwitkowski wwitkowski changed the title Add CustomSql analyzer Issue #243 - Add CustomSql analyzer May 30, 2025
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: db2249a9) — may not be fully accurate. Reply if this doesn't help.

Comment thread tests/test_analyzers.py
self.assertEqual(self.CountDistinct("b"), [Row(value=1.0)])

def test_CustomSql(self):
self.df.createOrReplaceTempView("input_table")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDGE_CASE: The test test_CustomSql relies on self.df.createOrReplaceTempView("input_table") being called within the test method, but the helper method CustomSql (line 115) calls self.AnalysisRunner.onData(self.df) which registers the DataFrame for analysis but does NOT register the temp view. If test_CustomSql runs before any other test that might register the view, it works. However, test_fail_CustomSql and test_fail_CustomSql_incorrect_query do NOT call createOrReplaceTempView, so they depend on the temp view persisting from test_CustomSql. If test execution order changes (e.g., running test_fail_CustomSql_incorrect_query in isolation), the temp view won't exist and the test will fail for the wrong reason (missing table, not incorrect query).

Line 309: self.df.createOrReplaceTempView("input_table") is only in test_CustomSql. Lines 320-321 (test_fail_CustomSql) and 324-325 (test_fail_CustomSql_incorrect_query) use "input_table" without registering the view. The setUpClass method (lines 39-47 in full source) does not register a temp view.

Comment thread tests/test_analyzers.py
@pytest.mark.xfail(reason="@unittest.expectedFailure")
def test_fail_CustomSql(self):
self.assertEqual(self.CustomSql("SELECT SUM(b) FROM input_table"), [Row(value=1.0)])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MISSING_TEST: The test test_fail_CustomSql_incorrect_query uses SELECT SUM(b) without a FROM clause. This is expected to fail, but the xfail marker means the test passes if an exception is raised. However, there's no test that verifies the correct failure mode (e.g., that a specific exception type is raised or that the metric result is NaN/empty). A proper negative test should use pytest.raises or check the specific error behavior rather than relying on any exception being acceptable.

Line 324-325: @pytest.mark.xfail(reason="@unittest.expectedFailure") followed by self.CustomSql("SELECT SUM(b)") - this will pass as long as ANY exception is raised, not specifically validating the expected behavior for an invalid SQL query.

Comment thread pydeequ/analyzers.py
:return self
"""
return self._deequAnalyzers.CustomSql(self.expression, self.disambiguator)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDGE_CASE: The CustomSql analyzer passes self.expression and self.disambiguator directly to the Scala CustomSql constructor. However, looking at the Deequ Scala source for CustomSql, the constructor signature is CustomSql(expression: String, disambiguator: Option[String]) in some versions. If the Scala API expects an Option[String] for disambiguator, passing a plain Python string would cause a Py4J type mismatch error at runtime.

Line 385: return self._deequAnalyzers.CustomSql(self.expression, self.disambiguator) passes self.disambiguator as a raw string. Other analyzers in this file wrap optional parameters with self._jvm.scala.Option.apply(...) (e.g., line 181 for ApproxCountDistinct: self._jvm.scala.Option.apply(self.where)). If the Scala CustomSql expects Option[String], this will fail at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant