Skip to content

[SPARK-17641][SQL] Collect_list/Collect_set should not collect null values. #15208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

We added native versions of collect_set and collect_list in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the Collect.update method.

How was this patch tested?

Added a regression test to DataFrameAggregateSuite.

@hvanhovell
Copy link
Contributor Author

cc @mengxr

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65806 has finished for PR 15208 at commit 37c4539.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sameeragarwal sameeragarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending jenkins

@@ -65,7 +65,10 @@ abstract class Collect extends ImperativeAggregate {
}

override def update(b: MutableRow, input: InternalRow): Unit = {
buffer += child.eval(input)
val value = child.eval(input)
if (value != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to add a comment here that this mimics the hive semantics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65992 has finished for PR 15208 at commit 80b2166.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #66000 has finished for PR 15208 at commit 80b2166.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Sep 28, 2016

Merging in master/2.0.

asfgit pushed a commit that referenced this pull request Sep 28, 2016
…alues.

## What changes were proposed in this pull request?
We added native versions of `collect_set` and `collect_list` in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the `Collect.update` method.

## How was this patch tested?
Added a regression test to `DataFrameAggregateSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15208 from hvanhovell/SPARK-17641.

(cherry picked from commit 7d09232)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 7d09232 Sep 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants