-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset #11841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @liancheng and @sameeragarwal |
Test build #53606 has finished for PR 11841 at commit
|
@@ -315,6 +315,7 @@ object MimaExcludes { | |||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataFrame"), | |||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataFrame$"), | |||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.LegacyFunctions"), | |||
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.GroupedDataset"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to handle GroupedData
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mima didn't complain.
LGTM except for one MiMA check question. |
Test build #53607 has finished for PR 11841 at commit
|
retest this please |
Test build #53617 has finished for PR 11841 at commit
|
Thanks - merging in master. |
## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two. Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields. This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it. ## How was this patch tested? This is a rename to improve API understandability. Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#11841 from rxin/SPARK-13897.
What changes were proposed in this pull request?
Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.
Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.
This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.
How was this patch tested?
This is a rename to improve API understandability. Should be covered by all existing tests.