[SPARK-52226] [SQL] Strengthen unusual equality checks in three operators #50949

li-boxuan · 2025-05-20T05:40:59Z

What changes were proposed in this pull request?

This PR proposed to make equality checks stricter in BatchScanExec, ContinuousScanExec, and MicroBatchScanExec, by making two objects non-equal if they are of different concrete classes.

Why are the changes needed?

e97ab1d#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38 introduced equals functions to a few v2 data source operators, while none of the other operators has equals override.

This means equivalence checks of BatchScanExec, ContinuousScanExec, and MicroBatchScanExec are much looser than all other operators. It doesn't seem to be intentional; it looks like an overlook to me - different operators should follow the same set of basic contracts if possible, if not, they shall not be too different from each other. Notably, the original author also left a TODO to "unify" them.

Now we live in a world where most operators have strictest equivalence checks, while a few operators have loose equivalence checks. What could go wrong? Well, since Spark is extensible, it is possible to inherit Spark's operators with modified runtime implementation while delivering same results. In fact, that's what https://github.com/apache/incubator-gluten project does, whereas (most) Spark operators are inherited by Gluten operators. Given the loose equivalence checks of Spark operators, we could end up declaring equivalence between a Spark operator and a Gluten operator.

If Spark starts with a clear contract that operators are "equal" as long as they deliver same results (like the abstract util classes in JDK), it would be probably fine. Now we live in a world where most operators don't do this except for the 3 operators I mentioned above. This is very easy to miss, and has caused unexpected behavior/bugs in downstream applications.

Does this PR introduce any user-facing change?

No, unless "user" means downstream applications that integrates with Spark (like Gluten).

How was this patch tested?

No. Unittests, if any, would be to test that two instances of subclasses of BatchScanExec (let's say BatchScanExec and MockBatchScanExec) are not equivalent, but that is kinda obvious just from the code itself.

Was this patch authored or co-authored using generative AI tooling?

No.

li-boxuan · 2025-05-20T05:44:28Z

Hi @cloud-fan , since this PR modifies some code you introduced in e97ab1d#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38, would you mind taking a look? Thanks!

li-boxuan · 2025-05-20T17:25:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

@@ -48,8 +48,8 @@ case class BatchScanExec(
  // TODO: unify the equal/hashCode implementation for all data source v2 query plans.
  override def equals(other: Any): Boolean = other match {
    case other: BatchScanExec =>
-      this.batch != null && this.batch == other.batch &&
-          this.runtimeFilters == other.runtimeFilters &&
+      this.getClass == other.getClass && this.batch != null &&


Even though this is a case class, there's no final modifier, making it still possible to be extended. It seems to be that we should either mark this class as final, or explicitly check class equivalence here to prevent misuse.

github-actions bot added the SQL label May 20, 2025

Strengthen data source v2 operators' equality checks

35329bd

li-boxuan force-pushed the fix-v2-data-source-equals branch from e5afeda to 35329bd Compare May 20, 2025 05:43

li-boxuan changed the title ~~[SPARK-52226] Strengthen unusual equality checks in three operators~~ [SPARK-52226] [SQL] Strengthen unusual equality checks in three operators May 20, 2025

li-boxuan commented May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-52226] [SQL] Strengthen unusual equality checks in three operators #50949

[SPARK-52226] [SQL] Strengthen unusual equality checks in three operators #50949

li-boxuan commented May 20, 2025 •

edited

Loading

li-boxuan commented May 20, 2025

li-boxuan May 20, 2025

[SPARK-52226] [SQL] Strengthen unusual equality checks in three operators #50949

Are you sure you want to change the base?

[SPARK-52226] [SQL] Strengthen unusual equality checks in three operators #50949

Conversation

li-boxuan commented May 20, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

li-boxuan commented May 20, 2025

li-boxuan May 20, 2025

Choose a reason for hiding this comment

li-boxuan commented May 20, 2025 •

edited

Loading