-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-45926][SQL] Implementing equals and hashCode which takes into account pushed runtime filters , in InMemoryTable related scans #49153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…pushed runtime filters , in InMemoryTable related scans
Previous comments: Do you happen to know why we didn't catch the following? Do you think you can provide a test coverage for that? If these classes implement equals and hashCode taking into account the pushed runtime filters, we would see that TPCDS Q14b which should ideally be reusing the exchange containing Union , is not happening due to multiple bugs which surface in AQE. @dongjoon-hyun I think the reason for not catching the issue of reuse of exchange is a mix of multiple things Spark is not testing with any concrete DataSourceV2 implementation. ( like iceberg) Though I ought to point out that while running my test I also hit the issue of computeStats being called twice ( which throws error only in testing). I have not debugged that... yet. And not sure if the assertion of computeStats occuring only once is maintainable.. added bug test in PR #49152 |
@@ -491,6 +492,16 @@ abstract class InMemoryBaseTable( | |||
} | |||
} | |||
} | |||
|
|||
override def equals(other: Any): Boolean = other match { | |||
case imbs: InMemoryBatchScan => this.readSchema == imbs.readSchema && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ahshahid , I am not from Databricks, and unfortunately I am not familar with Spark code enough to review this PR, but I do encounter some problems with equals
overrides in Spark in general, so dropping my point here and see if you have any idea.
Any reason you don't add
case imbs: InMemoryBatchScan => this.getClass == imbs.getClass && ...
to the equals check? Would this have a negative implication to your use case?
The reason I wanted stricter equals check is because of the project https://github.com/apache/incubator-gluten. In Gluten, which is a middle layer between Spark and native engines, (most) operators inherit Spark operators. If we don't do class equivalence check here, a Spark operator and a Gluten (native) operator would be regarded as equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @li-boxuan ,
In most of the cases, the spark classes are case classes of scala and usually its not recommended to override equals and hashCode of case classes, unless there is a specific requirement ( like you want to exclude certain member fields from equality check or need to do special handling ).
As for the reason why there is no
case imbs: InMemoryBatchScan => this.getClass == imbs.getClass &&
is because that its already getting accomplished.
The code snippet here implies that the "this" instance is InMemoryBatchScan ( because we are in its equal's method), and the case imbs: InMemoryBatchScan ensures that "other" is also of type InMemoryBatchScan.
which accomplishes what you are hinting at.
Regarding the issue which you are hitting , that would be possible only in those cases where the Spark Operator classes are not case classes (& there are only some exceptions) and case classes cannot get extended , right?
So if you provide some more details as to where the match is happening incorrectly , may shed some more light on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be possible only in those cases where the Spark Operator classes are not case classes (& there are only some exceptions) and case classes cannot get extended , right?
Aha, thanks. I am new to scala and didn't notice the difference between case classes and normal classes. You are right, this is a case class, which presumably shouldn't be extended - even though Scala compiler doesn't prohibit you from doing that. Gluten as a downstream application uses a plugin to explicitly prohibit that behavior, so we are good. Not an issue then. I saw an issue with some other classes, but it didn't happen with case class (thanks to the plugin).
That being said, it's probably safe to add such a check, because after all, Scala doesn't prohibit one from (wrongly) inheriting case classes. But anyways, not an issue for my use case.
What changes were proposed in this pull request?
Implementing equals and hashCode in the InMemoryBatchScan and InMemoryV2FilterBatchScan so that the pushed runtime filters are taken into account.
Why are the changes needed?
These test classes need the change so that the issue of re-use of exchange not happening in AQE when runtime filters are pushed to scan levels, can be reproduced. Currently either the tests are not actually validating the reuse of exchange happening when dynamic pruning expressions are pushed to scan levels in partitioned key based joins, or they are false passing because the equality of the test scans are not taking into account runtime filters.
Does this PR introduce any user-facing change?
No
How was this patch tested?
This is test class change., It is possibe that the existing tests may fail till the dependent fixes are made in source code.
Was this patch authored or co-authored using generative AI tooling?
no