Skip to content

[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Mar 19, 2016

What changes were proposed in this pull request?

Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.

Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.

This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.

How was this patch tested?

This is a rename to improve API understandability. Should be covered by all existing tests.

@rxin
Copy link
Contributor Author

rxin commented Mar 19, 2016

cc @liancheng and @sameeragarwal

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53606 has finished for PR 11841 at commit 24eaf42.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -315,6 +315,7 @@ object MimaExcludes {
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataFrame"),
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataFrame$"),
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.LegacyFunctions"),
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.GroupedDataset"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to handle GroupedData here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mima didn't complain.

@liancheng
Copy link
Contributor

LGTM except for one MiMA check question.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53607 has finished for PR 11841 at commit 3620da9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53617 has finished for PR 11841 at commit 3620da9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Mar 19, 2016

Thanks - merging in master.

@asfgit asfgit closed this in dcaa016 Mar 19, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?
Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.

Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.

This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.

## How was this patch tested?
This is a rename to improve API understandability. Should be covered by all existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#11841 from rxin/SPARK-13897.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants