[SPARK-52226] [SQL] Strengthen unusual equality checks in three operators #50949
+4
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR proposed to make equality checks stricter in BatchScanExec, ContinuousScanExec, and MicroBatchScanExec, by making two objects non-equal if they are of different concrete classes.
Why are the changes needed?
e97ab1d#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38 introduced
equals
functions to a few v2 data source operators, while none of the other operators hasequals
override.This means equivalence checks of BatchScanExec, ContinuousScanExec, and MicroBatchScanExec are much looser than all other operators. It doesn't seem to be intentional; it looks like an overlook to me - different operators should follow the same set of basic contracts if possible, if not, they shall not be too different from each other. Notably, the original author also left a TODO to "unify" them.
Now we live in a world where most operators have strictest equivalence checks, while a few operators have loose equivalence checks. What could go wrong? Well, since Spark is extensible, it is possible to inherit Spark's operators with modified runtime implementation while delivering same results. In fact, that's what https://github.com/apache/incubator-gluten project does, whereas (most) Spark operators are inherited by Gluten operators. Given the loose equivalence checks of Spark operators, we could end up declaring equivalence between a Spark operator and a Gluten operator.
If Spark starts with a clear contract that operators are "equal" as long as they deliver same results (like the abstract util classes in JDK), it would be probably fine. Now we live in a world where most operators don't do this except for the 3 operators I mentioned above. This is very easy to miss, and has caused unexpected behavior/bugs in downstream applications.
Does this PR introduce any user-facing change?
No, unless "user" means downstream applications that integrates with Spark (like Gluten).
How was this patch tested?
No. Unittests, if any, would be to test that two instances of subclasses of
BatchScanExec
(let's sayBatchScanExec
andMockBatchScanExec
) are not equivalent, but that is kinda obvious just from the code itself.Was this patch authored or co-authored using generative AI tooling?
No.