[SPARK-17528][SQL][followup] remove unnecessary data copy in object hash aggregate #18712

cloud-fan · 2017-07-22T12:23:23Z

What changes were proposed in this pull request?

In #18483 , we fixed the data copy bug when saving into InternalRow, and removed all workarounds for this bug in the aggregate code path. However, the object hash aggregate was missed, this PR fixes it.

This patch is also a requirement for #17419 , which shows that DataFrame version is slower than RDD version because of this issue.

How was this patch tested?

existing tests

cloud-fan · 2017-07-22T12:23:51Z

cc @liancheng @WeichenXu123

SparkQA · 2017-07-22T14:52:37Z

Test build #79866 has finished for PR 18712 at commit 887260a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

Looks good! this will be helpful for #17419
I will benchmark with the patch later.

liancheng · 2017-07-24T17:17:59Z

Nice, didn't know that the copy issue has already been fixed.

LGTM, merging to master.

## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in apache#17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in apache#18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000 ----|------|----|---|----|---- Dataframe | 15149 | 7441 | 2118 | 224 | 21 RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33 raw RDD | 53931 | 20683 | 3966 | 528 | 53 Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.

remove unnecessary data copy in object hash aggregate

887260a

WeichenXu123 reviewed Jul 22, 2017

View reviewed changes

asfgit closed this in 8666433 Jul 24, 2017

WeichenXu123 mentioned this pull request Aug 1, 2017

[SPARK-19634][ML] Multivariate summarizer - dataframes API #18798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17528][SQL][followup] remove unnecessary data copy in object hash aggregate #18712

[SPARK-17528][SQL][followup] remove unnecessary data copy in object hash aggregate #18712

Uh oh!

cloud-fan commented Jul 22, 2017

Uh oh!

cloud-fan commented Jul 22, 2017

Uh oh!

SparkQA commented Jul 22, 2017

Uh oh!

WeichenXu123 left a comment

Uh oh!

liancheng commented Jul 24, 2017

Uh oh!

Uh oh!

[SPARK-17528][SQL][followup] remove unnecessary data copy in object hash aggregate #18712

[SPARK-17528][SQL][followup] remove unnecessary data copy in object hash aggregate #18712

Uh oh!

Conversation

cloud-fan commented Jul 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jul 22, 2017

Uh oh!

SparkQA commented Jul 22, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 24, 2017

Uh oh!

Uh oh!