Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change grouping expressions in AggregateRel to references #706

Merged

Conversation

ingomueller-net
Copy link
Contributor

@ingomueller-net ingomueller-net commented Sep 13, 2024

BREAKING CHANGE: This PR changes the definition of grouping sets in AggregateRel to consist of references into a list of grouping expressions instead of consisting of expressions directly.

With the previous definition, consumers had to deduplicate the expressions in the grouping sets in order to execute the query or even derive the output schema (which is problematic, as explained below). With this change, the responsibility of deduplicating expressions is now on the producer. Concretely, consumers are now expected to be simpler: The list of grouping expressions immediately provides the information needed to derive the output schema and the list of grouping sets explicitly and unambiguously provides the equality of grouping expressions. Producers now have to specify the grouping sets explicitly. If their internal representation of grouping sets consists of full grouping expressions (rather than references), then they must deduplicate these expressions according to their internal notion of expression equality in order to produce grouping sets consisting of references to these deduplicated expressions.

If the previous format is desired, it can be obtained from the new format by (1) deduplicating the grouping expressions (according to the previously applicable definition of expression equality), (2) re-establishing the duplicates using the emit clause, and (3) "dereferencing" the references in the grouping sets, i.e., by replacing each reference in the grouping sets with the expression it refers to.

The previous version was problematic because it required the consumers to deduplicate the expressions from the grouping sets. This, in turn, requires to parse and understand 100% of these expression even in cases where that understanding is otherwise optional, which is in opposition to the general philosophy of allowing for simple-minded consumers. The new version avoids that problem and, thus, allows consumers to be simpler. The issues are discussed in even more detail in #700.

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@ingomueller-net ingomueller-net changed the title [WIP] feat: change grouping expressions in AggregateRel to references. [WIP] feat: change grouping expressions in AggregateRel to references Sep 13, 2024
@ingomueller-net ingomueller-net force-pushed the grouping-expression-references branch from 64e9e71 to edc8487 Compare September 13, 2024 08:59
@ingomueller-net ingomueller-net changed the title [WIP] feat: change grouping expressions in AggregateRel to references feat: change grouping expressions in AggregateRel to references [WIP] Sep 13, 2024
@ingomueller-net
Copy link
Contributor Author

This is still work in progress. The text part of the specification still needs to be adapted. But I guess the protobuf definition allows to agree on what our desired end goal is?

@ingomueller-net ingomueller-net force-pushed the grouping-expression-references branch from ea3f49d to 8f58598 Compare September 13, 2024 13:59
@jacques-n
Copy link
Contributor

I'm supportive of this change although it makes my stomach a little sick. Thoughts @westonpace @EpsilonPrime @vbarua ?

EpsilonPrime
EpsilonPrime previously approved these changes Sep 17, 2024
Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need a description of why and how to use the grouping expressions on the website but the approach is soundly backwards compatible --fix newer consumers will do the right thing for older plans. Producers may duplicate the old grouping expressions with the references to work with older consumers too.

@jacques-n
Copy link
Contributor

I'm +1 on the change but I think the release note/commit message needs to better clarify what needs to separate what was done versus the breaking change part. In the breaking change part we should be describing what specifically needs to be done for migration and what is actually breaking.

@ingomueller-net
Copy link
Contributor Author

Thanks, all, for being supporting and providing feedback! I have added a new commit that propagates the changes to the textual specification and changed/extended the commit message. Please let me know if those changes reflect what you had in mind.

BREAKING CHANGE: This PR changes the definition of grouping sets in
`AggregateRel` to consist of references into a list of grouping
expressions instead of consisting of expressions directly. As discussed
in more detail in substrait-io#700, the previous version is problematic because it
requires consumers to deduplicate these expressions, which, in turn,
requires to parse and understand 100% of these expression even in cases
where that understanding is otherwise optional. The new version avoids
that problem and, thus, allows consumers to be simpler.

Signed-off-by: Ingo Müller <ingomueller@google.com>
This is in line with other repeated fields and was suggested by @tokoko.

Signed-off-by: Ingo Müller <ingomueller@google.com>
Adapt the textual specification of `AgregateRel` in the specification,
which appears on the web site. In particular:

* Introduce the list of grouping expressions and define grouping sets as
  references into that list.
* Specify that each grouping expression must be referred to by at least
  one grouping set.
* Remove the mentioning of expression equality, which is now not needed
  anymore.

The commit also makes two non-semantic changes in the same paragraphs
that aim to clarify the text:

* Rephrase the explanation on when two records are folded into the same
  group to be more clear.
* Mention explicitly that one empty grouping set is equivalent to not
  having any grouping set.

Signed-off-by: Ingo Müller <ingomueller@google.com>
@ingomueller-net ingomueller-net force-pushed the grouping-expression-references branch from 9ada3c8 to b782814 Compare September 23, 2024 08:56
@ingomueller-net ingomueller-net changed the title feat: change grouping expressions in AggregateRel to references [WIP] feat: change grouping expressions in AggregateRel to references Sep 25, 2024
vbarua
vbarua previously approved these changes Sep 26, 2024
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. While this does add a little complexity, I think it removes a lot of ambiguity and avoids the need to specify expression equality which IMO is much harder.

@@ -345,13 +345,13 @@ The aggregate operation groups input data on one or more sets of grouping keys,
| Inputs | 1 |
| Outputs | 1 |
| Property Maintenance | Maintains distribution if all distribution fields are contained in every grouping set. No orderedness guaranteed. |
| Direct Output Order | The list of distinct columns from each grouping set (ordered by their first appearance) followed by the list of measures in declaration order, followed by an `i32` describing the associated particular grouping set the value is derived from (if applicable). |
| Direct Output Order | The list of grouping expressions in declaration order followed by the list of measures in declaration order, followed by an `i32` describing the associated particular grouping set the value is derived from (if applicable). |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An added benefit of this is that it makes it much easier to compute the output order. Declaration order is very clear ✨

westonpace
westonpace previously approved these changes Sep 26, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might actually be easier to understand than the original formulation (though maybe it's just because I know more now)

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
@vbarua vbarua requested a review from EpsilonPrime September 26, 2024 20:31
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@ingomueller-net ingomueller-net dismissed stale reviews from westonpace and vbarua via 9e0e320 September 27, 2024 07:51
@ingomueller-net
Copy link
Contributor Author

(I accepted @westonpace's suggestion to fix a typo -- now I need a new approval to be able to merge 😉)

@jacques-n
Copy link
Contributor

Given that we had at least 2 +1's before fixing the minor typo plus my vote I think we have 3 +1 votes and am going to merge. Will revert if someone feels I misinterpreted everyone's intent.

@jacques-n jacques-n merged commit 65a7d38 into substrait-io:main Sep 27, 2024
13 checks passed
@jacques-n
Copy link
Contributor

Thanks for your patience here @ingomueller-net , really appreciate the contribution and updates.

@ingomueller-net
Copy link
Contributor Author

Awesome, great to see this being wrapped up. Thank you all for the good discussions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants