Description
Is your feature request related to a problem or challenge?
(note this is mostly from @peter-toth 's description on #8891 (comment))
As @peter-toth noted, the current TreeNode
APIs are somewhat inconsistent which has the following challenges:
- adds a barrier to entry for newcomers to the DataFusion codebase
- The APIs sometimes can not (or do no) implement pruning / early tree traversal termination and therefore do more tree walking / transforming than necessary
- Increases code maintenance costs
Specific Challenges of the Current APIs
Some specific challenges with the current APIs. These can cause a newcommer (like me) some confusion:
-
The
apply
/visit
functions are capable to prune (skip) subtrees or return immediately (stop) but thetransform
functions are not. To make it more confusionrewrite
is also capable to to prune, but instead of using a common way to control recursion,visit
andrewrite
use different enums with conflicting semantics.
See this PR (ConsolidateTreeNode
transform and rewrite APIs #8891) for details onVisitRecursion
vsRewriteRecursion
. -
There is this
Transformed
enum whose purpose is to explicitely define if any change was made to a node. This enum is used intransform
clousres but for some reason it is missing fromrewrite
. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes.
See details in RemoveTransformed
enum #8835. I believe the current state simply doesn't make sense and just causes confusion. -
rewrite
also seems to be inconsistent withtransform_up
andtransform_down
.rewrite
probably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures fromtransform_up
andtransform_down
to define:- what transformation should be done before (pre-order) and after (post-order) transformation of the node's children
- and how the recursion should continue .
See this PR (Consolidate
TreeNode
transform and rewrite APIs #8891) for the details. IMO the currentTreeNodeRewriter.pre_visit()
seems like a mess and its logic is hard to grasp.
Specific missing APIs
- Pruning capability of
transform
functions:
Pruning would be very important as many of the transformations wouldn't require traversing on the whole tree and could improve performance a lot. - Payload propagation:
I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In Transform with payload #8664 with the help oftransform_with_payload
I not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement).
Describe the solution you'd like
The proposal is a general API like this:
/// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a
/// node while traversing the tree top-down (pre-order, before the node's children are
/// visited) while `f_up` is applied on a node while traversing the tree bottom-up
/// (post-order, after the the node's children are visited).
///
/// The `f_down` closure takes
/// - a `PD` type payload from its parent
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `PC` type payload to pass to `f_up`,
/// - a `Vec<FD>` type payload to propagate down to its children
/// (one `FD` element is propagated down to each child),
/// - a `TreeNodeRecursion` enum element to control recursion flow.
///
/// The `f_up` closure takes
/// - a `PC` type payload from `f_down` and
/// - a `Vec<PU>` type payload collected from its children
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `FU` type payload to propagate up to its parent,
/// - a `TreeNodeRecursion` enum element to control recursion flow.
fn transform_with_payload<FD, PD, PC, FU, PU>(
self,
f_down: &mut FD,
payload_down: PD,
f_up: &mut FU,
) -> Result<(Self, PU)>
where
FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>,
FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>,
Obviously the closure return types can be extracted to a type alias or struc
s like in the case of f_down
could be:
pub struct TransformDownResult<N, PD, PC> {
pub transformed_node: N,
pub payload_to_children: Vec<PD>,
pub payload_to_f_up: PC,
pub recursion_control: TreeNodeRecursion,
}
All the other functions can then be turned into a specialization of the above:
apply
(visit_down
): Onlyf_down
closure that doesn't return a modified node and payload (only retrurnsTreeNodeRecursion
)visit
/TreeNodeVisitor
: Bothf_down
andf_up
closures needed. The closures can be incorporated into aTreeNodeVisitor
object but they don't return a modified node and payloadtransform_down
: Onlyf_down
closure that doesn't return payloadtransform_up
: Onlyf_up
closure that doesn't return payloadtransform
: Bothf_down
andf_up
closures needed, but they don't return payloadsrewrite
/TreeNodeRewriter
: Bothf_down
andf_up
are incorporated into aTreeNodeRewriter
object, but they don't return a payloadtransform_down_with_payload
: Onlyf_down
closuretransform_up_with_payload
: Onlyf_up
closure
Describe alternatives you've considered
As noted above, there are a variety of PRs that test out various ideas in one way or another and the major challenge I think is figuring out what changes to make, in what order, and how much code churn to apply)
Additional context
Related tickets / PRs:
- Consolidate
TreeNode
transform and rewrite APIs #8891 - TreeNode refactor code deduplication: Part 3 #8817
- Transform with payload #8664
- Remove
Transformed
enum #8835 - Simplify Expr::map_children #9876
- Consistent LogicalPlan subquery handling in TreeNode::apply and TreeNode::visit #9913
- Simplify TreeNode recursions #9965
- Consolidate
LogicalPlan
tree node walking / rewriting code in one place #9994 - Deprecate TreeNode
transform_xx_mut
methods #10097