Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are RelRoots the "results" of a Plan? #612

Open
ingomueller-net opened this issue Mar 21, 2024 · 8 comments
Open

Are RelRoots the "results" of a Plan? #612

ingomueller-net opened this issue Mar 21, 2024 · 8 comments

Comments

@ingomueller-net
Copy link
Contributor

I am wondering what the "results" of a plan are, i.e., which of the PlanRels of the relations field are expected to be "returned". More concretely, which of the PlanRels could be optimized away or inlined into the other relations and which of them can't (because they are "results" and thus have to be present)? My guess is that that's the purpose of RelRoot (apart from specifying names) but I don't see it specified clearly.

No matter what the answer is, I think it'd be a good idea to add this to the specification, both in the proto files and the documentation/website (which currently doesn't mention PlanRel at all, AFAIK).

@EpsilonPrime
Copy link
Member

For most plans there is only one PlanRel which contains a root. That represents an entire plan.

The reason you can have multiple PlanRels I believe is for communication purposes (say a backend trying to provide a portion or multiple portions of a plan to another).

@ingomueller-net
Copy link
Contributor Author

For most plans there is only one PlanRel which contains a root. That represents an entire plan.

The reason you can have multiple PlanRels I believe is for communication purposes (say a backend trying to provide a portion or multiple portions of a plan to another).

This sounds like all PlanRels are the results of a plan, i.e., the client provides several portions and expects all of them to be executed and retrieved.

What about WITH clauses/non-recursive CTEs, then? They can currently be expressed with Rel/PlanRel in the relation and referred to with ReferenceRel. I'd argue that, if expressed in that form, those should definitely not be considered part of the result; they'd otherwise represent redundant work, potentially orders of magnitude more expensive to compute/retrieve than the actually desired result. To avoid that, the client would have to inline the CTEs itself, which I think puts unnecessary work on the client.

I think it would be a good idea to distinguish between relations returned by the plan and relations that are just temporary.

@westonpace
Copy link
Member

westonpace commented Mar 22, 2024

I think that it is safe to inline Rel into a RelRoot or to discard a Rel that is not referenced by some RelRoot (not even transitively). We can expand the documentation to clarify this.

I don't know if it is safe to declare that there is only one RelRoot and might shy away from calling it "the results of the plan". I believe the intention was for Substrait to be flexible and support case where there are multiple independent plans as @EpsilonPrime was referring. As another example, we once discussed the possibility of sending several "plan alternatives" and letting a consumer/producer pick one. In this case the roots are not the results of the plan. Only one of them is.

I think it'd be a good idea to add this to the specification, both in the proto files and the documentation/website (which currently doesn't mention PlanRel at all, AFAIK).

I think we're open to suggestions here. I think part of the reason this has been left out of the website is that the protobuf is essentially just a "container of plans" and the interpretation of the container layout is left up to the user. In other words, the following "example applications" would all be using substrait correctly, even though they have different interpretations about what the plans in the file are:

  • A consumer that expects exactly one RelRoot, executes it, and returns results
  • A transpiler which expect, as input, exactly one RelRoot, and a list of dialects (transmitted separately from the plan) and returns a buffer with multiple RelRoots (one per dialect), in the same order as the list of dialects.
  • A consumer that receives 1+ RelRoots, picks the first one it is capable of executing, and returns results.
  • A distributed scheduler that expects 1+ RelRoots, each one representing a stage of computation, and schedules that plan (e.g. the first RelRoot is "SCAN -> PARTIAL_AGG -> EXCHANGE" (meant to be run 1-per-partition) and the next RelRoot is "CAPTURE -> FINAL_AGG -> PROJECT" (meant to run on the "final" worker)

As a result, it is difficult for Substrait to define what exactly a "RelRoot" is. However, I do think we could provide some basic description of a "plan" concept with the idea that there are "root plans" and "not root plans" so that the first paragraph in this message is formally defined and intermediate libraries are free to optimize / discard "not root plans".

@ingomueller-net
Copy link
Contributor Author

I think that it is safe to inline Rel into a RelRoot or to discard a Rel that is not referenced by some RelRoot (not even transitively). We can expand the documentation to clarify this.

...

As a result, it is difficult for Substrait to define what exactly a "RelRoot" is. However, I do think we could provide some basic description of a "plan" concept with the idea that there are "root plans" and "not root plans" so that the first paragraph in this message is formally defined and intermediate libraries are free to optimize / discard "not root plans".

Thanks a lot for the detailed reply! The various use cases are quite interesting, actually!

So I think that there is an agreement that there are some Rels that must be preserved to keep the semantic of the plan, namely the (potential) "result"/"root" ones, and some that could be added or removed as long as the overall semantic of the other ones isn't changed, for example, those coming from WITH clauses if they are inlined.

I think that this is compatible with all examples above and independent of what exactly "root" Rels are. I guess that that latter question is part of the interface between the client and server and not of the plan format. (I imagine a computeAll would compute all "root" Rels as opposed to a computeOne, which would only compute one of them.)

It also sounds like RootRel is meant to be exactly those "root"/potential result Rels. Is that right? If yes, I small addition to the documentation would indeed be good, I think.

@ingomueller-net
Copy link
Contributor Author

Borderline off-topic: I am thinking of an analogy to symbol visibility in C/C++: symbols can be public, in which case the compiler must put the symbol into the resulting binary, or hidden, in which case it can inline them or remove them, if they aren't used by any other symbol. In other words, only the public ones are part of the interface while the hidden ones are not.

Similarly, the RelRoots seem to be "public" and the remaining Rels "hidden".

@ingomueller-net
Copy link
Contributor Author

Let me add two data points to the discussion:

  • DuckDB only accepts plans whose first relation is a RelRoot (and doesn't even properly handle the case where the first relation is simply a Rel, see here).
  • Acero takes any of the two (see here).

I think that this shows that the two groups of developers have not understood the specification in the same way.

@westonpace
Copy link
Member

I think that this shows that the two groups of developers have not understood the specification in the same way.

Acero was a pretty early adopter of Substrait. A common problem we encountered was receiving plans that were technically invalid but also generally understandable. Since Acero is not a validator and we didn't want to be blocked by producer bugs we went ahead and made the consumer fairly permissive where it seemed reasonably straightforward what the intention was. Another example of this is that most producers do not set the URI of functions (or set it to "/") which Acero tolerates as well.

When Acero produces plans it always generates a single RelRoot.

@westonpace
Copy link
Member

I have made an attempt at clarifying the behavior in #616

jacques-n pushed a commit that referenced this issue Aug 1, 2024
#612 and #613 highlighted a number of ambiguities in plan
interpretation. This should address those ambiguities. 

Co-authored-by: Ingo Müller <github.com@ingomueller.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants