-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-24402] [SQL] Optimize In
expression when only one element in the collection or collection is empty
#21442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #91218 has finished for PR 21442 at commit
|
@@ -191,4 +206,21 @@ class OptimizeInSuite extends PlanTest { | |||
|
|||
comparePlans(optimized, correctAnswer) | |||
} | |||
|
|||
test("OptimizedIn test: In empty list gets transformed to " + | |||
"If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) when value is nullable") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you make the title simpler?
Thanks for the work. One question; do we have actual performance changes with/without this pr? |
// When v is not nullable, the following expression will be optimized | ||
// to FalseLiteral which is tested in OptimizeInSuite.scala | ||
If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) | ||
case In(v, Seq(elem @ Literal(_, _))) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved inside case expr @ In(v, list) if expr.inSetConvertible
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a bug when the Literal is a struct. See the test failure: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91218/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you suggested, I'll move into case expr @ In(v, list) if expr.inSetConvertible
.
Yes, I saw the test failure. The same code was passing the build in my another PR. 1332406 I will debug it tomorrow.
Thanks,
@maropu I didn't do a full performance benchmark. I believe the performance gain can be from predicate pushdown when only one element in the set. This can be a lot. I forgot which one, I remember in impala or drill, they allow multiple predicates to be pushed down, and I believe this can be a win in many cases. |
Test build #91243 has finished for PR 21442 at commit
|
case expr @ In(v, list) if expr.inSetConvertible => | ||
val newList = ExpressionSet(list).toSeq | ||
if (newList.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||
if (newList.length == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When list.length == 1
, we don't need to create ExpressionSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too minor, I'd like to keep the current code and not break the code flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
retest this please |
LGTM, pending jenkins |
Test build #91258 has finished for PR 21442 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (newList.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||
if (newList.length == 1) { | ||
EqualTo(v, newList.head) | ||
} else if (newList.size > SQLConf.get.optimizerInSetConversionThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: size => length
because we use length
in the previous if
if (newList.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||
if (newList.length == 1) { | ||
EqualTo(v, newList.head) | ||
} else if (newList.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||
val hSet = newList.map(e => e.eval(EmptyRow)) | ||
InSet(v, HashSet() ++ hSet) | ||
} else if (newList.size < list.size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In line 235 the comment
// newList.length == list.length
can be updated as
// newList.length == list.length && newList.length > 1
case expr @ In(v, list) if expr.inSetConvertible => | ||
val newList = ExpressionSet(list).toSeq | ||
if (newList.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||
if (newList.length == 1) { | ||
EqualTo(v, newList.head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail, since the schema mismatches when the data type is struct. The test cases were added a few days ago. #21425
retest this please |
@HyukjinKwon The code has a bug. Please read the discussion before you trigger the test. |
@gatorsmile, can I just see if it the build passes against the latest build? |
@HyukjinKwon All the involved reviewers will get a ping. This is annoying to see many pings within one hour, right? My suggestion is to read the comments before triggering the test |
This should get updated, right? I should give a ping here anyway. I wonder why triggering retesting so matters for some PRs. Probably, you are mostly talking about "ok to test" that I left for the old PRs which gave pings for guys. In this case, I think the root cause is that we started to block the Jenkins tests for some old PRs for the reason I don't know. This already gives a lot of pings though so far. If the author is willing to update its PR, then it's a blocker again to ask Jenkins build. |
Also, see the stale PRs and see what we have delayed to take a look. Please consider partly this is what we should have checked and reviewed earlier, and/or partly authors haven't updated their PRs. Both cases need pings, right? This is what we should do but we couldn't. Mostly these were inactive for more than a month. |
@HyukjinKwon Thank you for trying to trigger the tests but it will not help if you got many pings within one hour. To save the resource, we need to do more investigation before blindly triggering the test. |
I am not blindly triggering the test. I skimmed and only re-triggered tests some PRs while I am skimming stale PRs. For "ok to test", I haven't also blindly re-triggered. I did where any committer initially triggered or I see some values while skimming stale PRs. Getting pings is annoying of course but the root cause is we didn't check the PRs so far diligently, or authors were being inactive. |
If you want to help this, we should ping the reviewers/committers based on the priority of these PRs. Also, we should not trigger too many pings within a short period. Please reduce the size of the batch to 10s per day? We need to seriously consider which new features we really want to introduce in the upcoming release. We only have 15 days left before the code freeze of Spark 2.4. |
Of course, the time and priority matter but there are pending PRs queued up due to the time and priority matter so far. Shall we check the PRs and see if there are important ones for Spark 2.4 before the branch is cut out since the release is being close? Otherwise, most of them will probably be missed by the time and priority matter again. I don't think I have done ping and retriggering things for roughly about the last year in such batch. |
Also, the root cause of getting a lot of pings is that we somehow started to block the Jenkins build for some old PRs. Probably I missed some threads in dev mailing list but I still don't know who and why started this. In a way to be honest, I kind of felt getting annoyed by this too somehow probably by the similar reason you gave, even though I haven't expressed this so far for my reasons above. |
@HyukjinKwon Currently, in my opinion, the highest priority PRs include Parquet nested column pruning (https://github.com/apache/spark/pull/21320/files), new built-in avro, high-order functions, data correctness issue of Shuffle+Repartition on RDD, gang scheduling, the tasks of native K8S support, and so on. If you have bandwidth, please also help the reviews in these PRs. |
@dbtsai I think this PR can be unblocked if we only turn |
Test build #93076 has finished for PR 21442 at commit
|
@HyukjinKwon thanks for bringing this to my attention. @gatorsmile I thought the bug is found by this PR, and not in this PR. This PR is blocked until SPARK-24443 is addressed. I'll unblocck this PR by turning |
Test build #93130 has finished for PR 21442 at commit
|
Test build #93136 has finished for PR 21442 at commit
|
retest this please. |
LGTM. The test failure is not related to this PR. Thanks! Merged to master. |
Test build #93143 has finished for PR 21442 at commit
|
Seems like a related test failure? cc @dbtsai |
This hasn't get any test pass even once. The test is broken by this commit: Before:
After:
Reverting this since this will probably break the build and this PR hasn't got any successful build in any event. |
@dbtsai can you open a new PR? thanks! |
I opened a new PR at https://github.com/apache/spark/pull/21797/files Will work on the test issue there. Thanks. |
… dates ## What changes were proposed in this pull request? This PR optimizes `InSet` expressions for byte, short, integer, date types. It is a follow-up on PR #21442 from dbtsai. `In` expressions are compiled into a sequence of if-else statements, which results in O\(n\) time complexity. `InSet` is an optimized version of `In`, which is supposed to improve the performance if all values are literals and the number of elements is big enough. However, `InSet` actually worsens the performance in many cases due to various reasons. The main idea of this PR is to use Java `switch` statements to significantly improve the performance of `InSet` expressions for bytes, shorts, ints, dates. All `switch` statements are compiled into `tableswitch` and `lookupswitch` bytecode instructions. We will have O\(1\) time complexity if our case values are compact and `tableswitch` can be used. Otherwise, `lookupswitch` will give us O\(log n\). Locally, I tried Spark `OpenHashSet` and primitive collections from `fastutils` in order to solve the boxing issue in `InSet`. Both options significantly decreased the memory consumption and `fastutils` improved the time compared to `HashSet` from Scala. However, the switch-based approach was still more than two times faster even on 500+ non-compact elements. I also noticed that applying the switch-based approach on less than 10 elements gives a relatively minor improvement compared to the if-else approach. Therefore, I placed the switch-based logic into `InSet` and added a new config to track when it is applied. Even if we migrate to primitive collections at some point, the switch logic will be still faster unless the number of elements is really big. Another option is to have a separate `InSwitch` expression. However, this would mean we need to modify other places (e.g., `DataSourceStrategy`). See [here](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10) and [here](https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch) for more information. This PR does not cover long values as Java `switch` statements cannot be used on them. However, we can have a follow-up PR with an approach similar to binary search. ## How was this patch tested? There are new tests that verify the logic of the proposed optimization. The performance was evaluated using existing benchmarks. This PR was also tested on an EC2 instance (OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64, Intel(R) Xeon(R) CPU E5-2686 v4 2.30GHz). ## Notes - [This link](http://hg.openjdk.java.net/jdk8/jdk8/langtools/file/30db5e0aaf83/src/share/classes/com/sun/tools/javac/jvm/Gen.java#l1153) contains source code that decides between `tableswitch` and `lookupswitch`. The logic was re-used in the benchmarks. See the `isLookupSwitch` method. Closes #23171 from aokolnychyi/spark-26205. Lead-authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Two new rules in the logical plan optimizers are added.
Collection
, thephysical plan will be optimized to
EqualTo
, so predicatepushdown can be used.
Collection
is empty, and the input is nullable, thelogical plan will be simplified to
TODO:
we should still allow predicate pushdown.
In
usingtableswitch
orlookupswitch
when the numbers of the categories are low, and they are
Int
,Long
.should do benchmark for using different set implementation for faster
query.
filter(if (condition) null else false)
can be optimized to false.How was this patch tested?
Couple new tests are added.