Skip to content

[SPARK-45926][SQL] Implementing equals and hashCode which takes into account pushed runtime filters , in InMemoryTable related scans #49153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ahshahid
Copy link

What changes were proposed in this pull request?

Implementing equals and hashCode in the InMemoryBatchScan and InMemoryV2FilterBatchScan so that the pushed runtime filters are taken into account.

Why are the changes needed?

These test classes need the change so that the issue of re-use of exchange not happening in AQE when runtime filters are pushed to scan levels, can be reproduced. Currently either the tests are not actually validating the reuse of exchange happening when dynamic pruning expressions are pushed to scan levels in partitioned key based joins, or they are false passing because the equality of the test scans are not taking into account runtime filters.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is test class change., It is possibe that the existing tests may fail till the dependent fixes are made in source code.

Was this patch authored or co-authored using generative AI tooling?

no

…pushed runtime filters , in InMemoryTable related scans
@github-actions github-actions bot added the SQL label Dec 11, 2024
@ahshahid
Copy link
Author

ahshahid commented Dec 11, 2024

Previous comments:

Do you happen to know why we didn't catch the following? Do you think you can provide a test coverage for that?

If these classes implement equals and hashCode taking into account the pushed runtime filters, we would see that TPCDS Q14b which should ideally be reusing the exchange containing Union , is not happening due to multiple bugs which surface in AQE.

@dongjoon-hyun I think the reason for not catching the issue of reuse of exchange is a mix of multiple things

Spark is not testing with any concrete DataSourceV2 implementation. ( like iceberg)
The simulation of DataSourceV2 impl using InMemoryTableScan is buggy because of equals / hashcode not taking into account pushed runtime filters, as a result any reuse of exchange bug would not be caught ( i.e mismatch of cached exchange plans would not be detected, giving a false assurance of re-use0
If I am not wrong, the tpcds tests are run using Hive as DataSource and not sure if it supports push down of runtime filters.
The bug in AQE only shows in TPCDS if table are partitioned and equi join involves partitioning column. I am not sure if right now various tpcds tests use partitioned table or not.
Yes I have been able to reproduce the issue using InMemoryTableScans as DataSourceV2 impl for tpcds tests. I will checkin a prototype test for reproducing the bug using q14b and if needed all other queries of tpcds can be run.

Though I ought to point out that while running my test I also hit the issue of computeStats being called twice ( which throws error only in testing). I have not debugged that... yet. And not sure if the assertion of computeStats occuring only once is maintainable..

added bug test in PR #49152

@@ -491,6 +492,16 @@ abstract class InMemoryBaseTable(
}
}
}

override def equals(other: Any): Boolean = other match {
case imbs: InMemoryBatchScan => this.readSchema == imbs.readSchema &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ahshahid , I am not from Databricks, and unfortunately I am not familar with Spark code enough to review this PR, but I do encounter some problems with equals overrides in Spark in general, so dropping my point here and see if you have any idea.

Any reason you don't add

case imbs: InMemoryBatchScan => this.getClass == imbs.getClass && ...

to the equals check? Would this have a negative implication to your use case?

The reason I wanted stricter equals check is because of the project https://github.com/apache/incubator-gluten. In Gluten, which is a middle layer between Spark and native engines, (most) operators inherit Spark operators. If we don't do class equivalence check here, a Spark operator and a Gluten (native) operator would be regarded as equal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @li-boxuan ,
In most of the cases, the spark classes are case classes of scala and usually its not recommended to override equals and hashCode of case classes, unless there is a specific requirement ( like you want to exclude certain member fields from equality check or need to do special handling ).
As for the reason why there is no
case imbs: InMemoryBatchScan => this.getClass == imbs.getClass &&
is because that its already getting accomplished.
The code snippet here implies that the "this" instance is InMemoryBatchScan ( because we are in its equal's method), and the case imbs: InMemoryBatchScan ensures that "other" is also of type InMemoryBatchScan.
which accomplishes what you are hinting at.

Regarding the issue which you are hitting , that would be possible only in those cases where the Spark Operator classes are not case classes (& there are only some exceptions) and case classes cannot get extended , right?

So if you provide some more details as to where the match is happening incorrectly , may shed some more light on the issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be possible only in those cases where the Spark Operator classes are not case classes (& there are only some exceptions) and case classes cannot get extended , right?

Aha, thanks. I am new to scala and didn't notice the difference between case classes and normal classes. You are right, this is a case class, which presumably shouldn't be extended - even though Scala compiler doesn't prohibit you from doing that. Gluten as a downstream application uses a plugin to explicitly prohibit that behavior, so we are good. Not an issue then. I saw an issue with some other classes, but it didn't happen with case class (thanks to the plugin).

That being said, it's probably safe to add such a check, because after all, Scala doesn't prohibit one from (wrongly) inheriting case classes. But anyways, not an issue for my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants