Skip to content

Epic: Unified TreeNode rewrite API #8913

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

(note this is mostly from @peter-toth 's description on #8891 (comment))

As @peter-toth noted, the current TreeNode APIs are somewhat inconsistent which has the following challenges:

  1. adds a barrier to entry for newcomers to the DataFusion codebase
  2. The APIs sometimes can not (or do no) implement pruning / early tree traversal termination and therefore do more tree walking / transforming than necessary
  3. Increases code maintenance costs

Specific Challenges of the Current APIs

Some specific challenges with the current APIs. These can cause a newcommer (like me) some confusion:

  • The apply / visit functions are capable to prune (skip) subtrees or return immediately (stop) but the transform functions are not. To make it more confusion rewrite is also capable to to prune, but instead of using a common way to control recursion, visit and rewrite use different enums with conflicting semantics.
    See this PR (Consolidate TreeNode transform and rewrite APIs #8891) for details on VisitRecursion vs RewriteRecursion.

  • There is this Transformed enum whose purpose is to explicitely define if any change was made to a node. This enum is used in transform clousres but for some reason it is missing from rewrite. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes.
    See details in Remove Transformed enum #8835. I believe the current state simply doesn't make sense and just causes confusion.

  • rewrite also seems to be inconsistent with transform_up and transform_down. rewrite probably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures from transform_up and transform_down to define:

    • what transformation should be done before (pre-order) and after (post-order) transformation of the node's children
    • and how the recursion should continue .

    See this PR (Consolidate TreeNode transform and rewrite APIs #8891) for the details. IMO the current TreeNodeRewriter.pre_visit() seems like a mess and its logic is hard to grasp.

Specific missing APIs

  • Pruning capability of transform functions:
    Pruning would be very important as many of the transformations wouldn't require traversing on the whole tree and could improve performance a lot.
  • Payload propagation:
    I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In Transform with payload #8664 with the help of transform_with_payload I not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement).

Describe the solution you'd like

The proposal is a general API like this:

    /// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a
    /// node while traversing the tree top-down (pre-order, before the node's children are
    /// visited) while `f_up` is applied on a node while traversing the tree bottom-up
    /// (post-order, after the the node's children are visited).
    ///
    /// The `f_down` closure takes
    /// - a `PD` type payload from its parent
    /// and returns a tuple made of:
    /// - a possibly modified node,
    /// - a `PC` type payload to pass to `f_up`,
    /// - a `Vec<FD>` type payload to propagate down to its children
    ///    (one `FD` element is propagated down to each child),
    /// - a `TreeNodeRecursion` enum element to control recursion flow.
    ///
    /// The `f_up` closure takes
    /// - a `PC` type payload from `f_down` and
    /// - a `Vec<PU>` type payload collected from its children
    /// and returns a tuple made of:
    /// - a possibly modified node,
    /// - a `FU` type payload to propagate up to its parent,
    /// - a `TreeNodeRecursion` enum element to control recursion flow.
    fn transform_with_payload<FD, PD, PC, FU, PU>(
        self,
        f_down: &mut FD,
        payload_down: PD,
        f_up: &mut FU,
    ) -> Result<(Self, PU)>
    where
        FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>,
        FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>,

Obviously the closure return types can be extracted to a type alias or strucs like in the case of f_down could be:

pub struct TransformDownResult<N, PD, PC> {
    pub transformed_node: N,
    pub payload_to_children: Vec<PD>,
    pub payload_to_f_up: PC,
    pub recursion_control: TreeNodeRecursion,
}

All the other functions can then be turned into a specialization of the above:

  • apply (visit_down): Only f_down closure that doesn't return a modified node and payload (only retrurns TreeNodeRecursion)
  • visit / TreeNodeVisitor: Both f_down and f_up closures needed. The closures can be incorporated into a TreeNodeVisitor object but they don't return a modified node and payload
  • transform_down: Only f_down closure that doesn't return payload
  • transform_up: Only f_up closure that doesn't return payload
  • transform: Both f_down and f_up closures needed, but they don't return payloads
  • rewrite / TreeNodeRewriter: Both f_down and f_up are incorporated into a TreeNodeRewriter object, but they don't return a payload
  • transform_down_with_payload: Only f_down closure
  • transform_up_with_payload: Only f_up closure

Describe alternatives you've considered

As noted above, there are a variety of PRs that test out various ideas in one way or another and the major challenge I think is figuring out what changes to make, in what order, and how much code churn to apply)

Additional context

Related tickets / PRs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions