[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841

rxin · 2016-03-19T06:29:00Z

What changes were proposed in this pull request?

Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.

Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.

This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.

How was this patch tested?

This is a rename to improve API understandability. Should be covered by all existing tests.

rxin · 2016-03-19T06:30:02Z

cc @liancheng and @sameeragarwal

SparkQA · 2016-03-19T06:41:05Z

Test build #53606 has finished for PR 11841 at commit 24eaf42.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-19T06:52:22Z

project/MimaExcludes.scala

@@ -315,6 +315,7 @@ object MimaExcludes {
        ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataFrame"),
        ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataFrame$"),
        ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.LegacyFunctions"),
+        ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.GroupedDataset"),


Don't we need to handle GroupedData here?

mima didn't complain.

liancheng · 2016-03-19T06:52:48Z

LGTM except for one MiMA check question.

SparkQA · 2016-03-19T08:26:21Z

Test build #53607 has finished for PR 11841 at commit 3620da9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-19T13:53:29Z

retest this please

SparkQA · 2016-03-19T16:04:01Z

Test build #53617 has finished for PR 11841 at commit 3620da9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-19T18:22:49Z

Thanks - merging in master.

## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two. Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields. This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it. ## How was this patch tested? This is a rename to improve API understandability. Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#11841 from rxin/SPARK-13897.

[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset

24eaf42

mima

3620da9

liancheng reviewed Mar 19, 2016
View reviewed changes

asfgit closed this in dcaa016 Mar 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841

[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841

Uh oh!

rxin commented Mar 19, 2016

Uh oh!

rxin commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

liancheng Mar 19, 2016

Uh oh!

rxin Mar 19, 2016

Uh oh!

liancheng commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

liancheng commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

rxin commented Mar 19, 2016

Uh oh!

Uh oh!

[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841

[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841

Uh oh!

Conversation

rxin commented Mar 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

liancheng Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

liancheng commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

rxin commented Mar 19, 2016

Uh oh!

Uh oh!