-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17641][SQL] Collect_list/Collect_set should not collect null values. #15208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @mengxr |
Test build #65806 has finished for PR 15208 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending jenkins
@@ -65,7 +65,10 @@ abstract class Collect extends ImperativeAggregate { | |||
} | |||
|
|||
override def update(b: MutableRow, input: InternalRow): Unit = { | |||
buffer += child.eval(input) | |||
val value = child.eval(input) | |||
if (value != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to add a comment here that this mimics the hive semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Test build #65992 has finished for PR 15208 at commit
|
retest this please |
Test build #66000 has finished for PR 15208 at commit
|
Merging in master/2.0. |
…alues. ## What changes were proposed in this pull request? We added native versions of `collect_set` and `collect_list` in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the `Collect.update` method. ## How was this patch tested? Added a regression test to `DataFrameAggregateSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15208 from hvanhovell/SPARK-17641. (cherry picked from commit 7d09232) Signed-off-by: Reynold Xin <rxin@databricks.com>
What changes were proposed in this pull request?
We added native versions of
collect_set
andcollect_list
in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to theCollect.update
method.How was this patch tested?
Added a regression test to
DataFrameAggregateSuite
.