Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for recursive CTEs #7581
Add support for recursive CTEs #7581
Changes from all commits
4418ad2
dac7f22
daa995c
aa6d74a
5839dd1
d8af7fb
515312c
f9faa05
38e95dd
219de0c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am considering whether the
NamedRelation
andRecursiveQuery
could be implemented as twoTableSource
s, one beingCTESelfRefTable
and the other beingCTERecursiveTable
, and then use TableScan to read them.Use
CTESelfRefTable
within the recursive term andCTERecursiveTable
in the outer query.But this idea is in its early stages and may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonahgao, could you provide the rationale for your suggested strategy? I'm interested in understanding why it might be more effective than the current implementation. Performance is critical to our use case. And the implementation for recursion is very sensitive to performance considerations, as the setup for execution and stream management isn't amortized over all input record batches. Instead, it's incurred with each iteration. For instance, we've observed a substantial performance boost—up to 30 times faster—by eliminating certain intermediate nodes, like coalesce, from our plan (as evidenced in this PR). I've drafted another PR that appears to again double the speed of execution merely by omitting metric collection in recursive sub-graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One rationale might be to make the implementation simpler -- if we could implement the recursive relation as a table provider, it would likely allow the changes to be more localized / smaller (e.g. maybe we could reuse
MemTable::load
to update the batches on each iteration)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I understand the need to have
LogicalPlan::RecursiveQuery
but I don't (yet) understand the need to have theNamedRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NamedRelation
is primarily a way to mirror batches back to theRecursiveQuery
via its physical counterpart,ContinuanceExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewgapp Another rationale might be to support pushing down filters to the working table, which may be useful if we support spilling the working table to disk in the future. I think the performance should not be affected, the execution of physical plans is almost the same as it is now.
I implemented a demo on this branch and in this commit. GitHub does not allow forking a repository twice, so I directly pushed it to another repository for convenience.
In this demo, I attempted to replace the
NamedRelation
with aTableProvider
, namelyCteWorkTable
. The benefit of this is that it can avoid maintaining a new logical plan.Another change is that I used a structure called
WorkTable
to connect theRecursiveQueryExec
and theWorkTableExec
(it was previouslyContinuanceExec
). The advantage of this is that it avoids maintaining some external context information, such asrelation_handlers
inTaskContext
, and thectx
increate_initial_plan
.The
WorkTable
is a shared table, it will be scanned by theWorkTableExec
during the execution of the recursive term, and after the execution is completed, the results will be written back to it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, tyty! I was in the process of implementing the shared table and my implementation turned out very similar to yours although I ended up working around the crate dependency graph constraints a bit differently by introducing a couple new traits. But I did end up exposing a method on the context to generate a table. I like your approach better.
I tested out your poc and performance remains about the same between my previous implementation and your new worktable approach! (which makes sense).
I'm going to go ahead and work based on your POC toward the list of PRs that Andrew wants to get this landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work and for the nexting contributions! @matthewgapp