Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to iterative DynNode and ConcreteTreeNode processing #13177

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Oct 29, 2024

Part of #9373

Rationale for this change

Currently, on large trees, iterative visiting and rewriting can cause stack overflow because node-processing algorithms are recursive.

What changes are included in this PR?

This PR implements iterative DFS-like tree processing for DynTreeNode. In the next PR, I will also update ConcreteTreeNode to use these functions.

Are these changes tested?

Yes, I've added a new test case to the existing tests and included a large tree test that fails with the recursion but passes with the iterative approach.

The diff for tests looks a bit messy, but it really just moving repetitive parts into a macro.

Are there any user-facing changes?

Yes, a minor update to with_new_arc_children.

@github-actions github-actions bot added physical-expr Physical Expressions common Related to common crate labels Oct 29, 2024
@blaginin blaginin changed the title Switch to recursive node DynNode visit and rewriting Switch to recursive node DynNode visiting and rewriting Oct 29, 2024
@blaginin blaginin changed the title Switch to recursive node DynNode visiting and rewriting Switch to iterative node DynNode visiting and rewriting Oct 29, 2024
@blaginin blaginin changed the title Switch to iterative node DynNode visiting and rewriting Switch to iterative DynNode visiting and rewriting Oct 29, 2024
@blaginin blaginin force-pushed the blaginin/switch-to-recursive-tree-iteration branch from b6fa0a7 to e88f47f Compare October 29, 2024 21:49
@blaginin blaginin force-pushed the blaginin/switch-to-recursive-tree-iteration branch from e88f47f to 83d194e Compare October 29, 2024 21:54
@blaginin blaginin force-pushed the blaginin/switch-to-recursive-tree-iteration branch from e6600ea to e475464 Compare October 30, 2024 20:44
@blaginin blaginin marked this pull request as ready for review October 30, 2024 20:45
@blaginin
Copy link
Contributor Author

@peter-toth Hey! I think you've worked on quite a few methods modified by this PR, in case you'd like to take a look 😀

@peter-toth
Copy link
Contributor

peter-toth commented Oct 30, 2024

@peter-toth Hey! I think you've worked on quite a few methods modified by this PR, in case you'd like to take a look 😀

Hi @blaginin, thanks for pinging me. I like the iterative approach, but I'm a bit concerned that this API can be much slower than the recursive one. Do you think you could run some benchmarks?
If this implementation turns out to be slow then maybe we could keep the recursive way and use a library like stacker or recursive to adjust the stack size.

cc @alamb, @ozankabak, @berkaysynnada

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Oct 31, 2024

Thank you @blaginin. First, I want to express that I share @peter-toth’s concern. Most of the rules are written with the assumption that the traversal is BFS DFS, and a change in this behavior is likely to cause a slowdown in planning. Perhaps we could allow the user to select the traversal type.

I also recommend making the test changes in a follow-up PR to facilitate easier review.

@ozankabak
Copy link
Contributor

I agree with both suggestions (benchmarks to see the impact and avoiding test changes in the first PR)

@peter-toth
Copy link
Contributor

peter-toth commented Oct 31, 2024

Hmm, do we have rules that expect BFS traversal? I think the current APIs only support pre-order/post-order (DFS like) traversals.
My perf. concern is more about the extra heap allocations that iterative implementations like this one need.

Also, I don't see yet how this will work on tree nodes that own their children (e.g. Expr), where cloning a node's children is costly, taking the children is not straightforward, but the current map_children() approach works quite well.
If the iterative implementation is not feasible on those trees then we should probably prefer a solution that works well on all kinds of trees.

@berkaysynnada
Copy link
Contributor

Hmm, do we have rules that expect BFS traversal? I think the current APIs only support pre-order/post-order (DFS like) traversals. My perf. concern is more about the extra heap allocations that iterative implementations like this one need.

I did a typo, I meant DFS (:

@berkaysynnada
Copy link
Contributor

If the iterative implementation is not feasible on those trees then we should probably prefer a solution that works well on all kinds of trees.

I agree with you. That clones could be huge. @blaginin do you have time to implement the similar version for ConcreteTreeNode's, and to take some benchmarks?

@peter-toth
Copy link
Contributor

peter-toth commented Oct 31, 2024

With ConcreteTreeNodes you can call take_children() to separate a node and its children to avoid cloning, but with Exprs you can't (or it is very tricky/ugly).

@alamb
Copy link
Contributor

alamb commented Oct 31, 2024

I will run the planning benchmarks on this PR

- reuse vec instead of making a new one
- use mut ref to the last element instead of taking it on every iteration
@alamb
Copy link
Contributor

alamb commented Oct 31, 2024

Here is what I got:

++ critcmp main blaginin_switch-to-recursive-tree-iteration
group                                         blaginin_switch-to-recursive-tree-iteration    main
-----                                         -------------------------------------------    ----
logical_aggregate_with_join                   1.00  1486.6±17.07µs        ? ?/sec            1.00  1482.2±13.86µs        ? ?/sec
logical_select_all_from_1000                  1.00      5.2±0.02ms        ? ?/sec            1.00      5.2±0.02ms        ? ?/sec
logical_select_one_from_700                   1.00  1181.2±14.48µs        ? ?/sec            1.01  1195.3±15.11µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00  1149.0±15.85µs        ? ?/sec            1.00  1151.4±15.81µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00  1138.9±75.30µs        ? ?/sec            1.00  1136.1±38.52µs        ? ?/sec
physical_intersection                         1.00      2.4±0.02ms        ? ?/sec            1.00      2.4±0.02ms        ? ?/sec
physical_join_consider_sort                   1.01      3.3±0.02ms        ? ?/sec            1.00      3.3±0.02ms        ? ?/sec
physical_join_distinct                        1.00  1123.1±14.51µs        ? ?/sec            1.01  1131.0±14.86µs        ? ?/sec
physical_many_self_joins                      1.00     17.2±0.14ms        ? ?/sec            1.00     17.1±0.10ms        ? ?/sec
physical_plan_clickbench_all                  1.01    229.1±2.04ms        ? ?/sec            1.00    226.5±2.10ms        ? ?/sec
physical_plan_clickbench_q1                   1.00      3.3±0.04ms        ? ?/sec            1.02      3.3±0.07ms        ? ?/sec
physical_plan_clickbench_q10                  1.01      4.4±0.05ms        ? ?/sec            1.00      4.3±0.06ms        ? ?/sec
physical_plan_clickbench_q11                  1.01      4.5±0.05ms        ? ?/sec            1.00      4.5±0.08ms        ? ?/sec
physical_plan_clickbench_q12                  1.01      4.7±0.07ms        ? ?/sec            1.00      4.6±0.07ms        ? ?/sec
physical_plan_clickbench_q13                  1.01      4.3±0.06ms        ? ?/sec            1.00      4.2±0.06ms        ? ?/sec
physical_plan_clickbench_q14                  1.01      4.5±0.05ms        ? ?/sec            1.00      4.5±0.07ms        ? ?/sec
physical_plan_clickbench_q15                  1.01      4.4±0.08ms        ? ?/sec            1.00      4.3±0.07ms        ? ?/sec
physical_plan_clickbench_q16                  1.00      3.8±0.05ms        ? ?/sec            1.00      3.8±0.06ms        ? ?/sec
physical_plan_clickbench_q17                  1.00      3.9±0.04ms        ? ?/sec            1.01      3.9±0.06ms        ? ?/sec
physical_plan_clickbench_q18                  1.00      3.6±0.05ms        ? ?/sec            1.01      3.6±0.06ms        ? ?/sec
physical_plan_clickbench_q19                  1.01      4.6±0.06ms        ? ?/sec            1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q2                   1.00      3.6±0.09ms        ? ?/sec            1.01      3.7±0.05ms        ? ?/sec
physical_plan_clickbench_q20                  1.00      3.3±0.04ms        ? ?/sec            1.00      3.3±0.04ms        ? ?/sec
physical_plan_clickbench_q21                  1.00      3.6±0.06ms        ? ?/sec            1.02      3.7±0.10ms        ? ?/sec
physical_plan_clickbench_q22                  1.02      4.6±0.09ms        ? ?/sec            1.00      4.5±0.04ms        ? ?/sec
physical_plan_clickbench_q23                  1.01      5.0±0.08ms        ? ?/sec            1.00      5.0±0.06ms        ? ?/sec
physical_plan_clickbench_q24                  1.00      5.7±0.05ms        ? ?/sec            1.00      5.7±0.08ms        ? ?/sec
physical_plan_clickbench_q25                  1.02      4.0±0.05ms        ? ?/sec            1.00      3.9±0.04ms        ? ?/sec
physical_plan_clickbench_q26                  1.00      3.6±0.04ms        ? ?/sec            1.00      3.6±0.05ms        ? ?/sec
physical_plan_clickbench_q27                  1.02      4.1±0.05ms        ? ?/sec            1.00      4.0±0.05ms        ? ?/sec
physical_plan_clickbench_q28                  1.01      4.8±0.07ms        ? ?/sec            1.00      4.7±0.06ms        ? ?/sec
physical_plan_clickbench_q29                  1.01      5.8±0.07ms        ? ?/sec            1.00      5.7±0.11ms        ? ?/sec
physical_plan_clickbench_q3                   1.00      3.5±0.04ms        ? ?/sec            1.03      3.6±0.05ms        ? ?/sec
physical_plan_clickbench_q30                  1.00     16.7±0.16ms        ? ?/sec            1.00     16.7±0.17ms        ? ?/sec
physical_plan_clickbench_q31                  1.00      4.8±0.07ms        ? ?/sec            1.01      4.9±0.17ms        ? ?/sec
physical_plan_clickbench_q32                  1.01      4.8±0.09ms        ? ?/sec            1.00      4.8±0.06ms        ? ?/sec
physical_plan_clickbench_q33                  1.01      4.4±0.06ms        ? ?/sec            1.00      4.3±0.06ms        ? ?/sec
physical_plan_clickbench_q34                  1.00      3.9±0.05ms        ? ?/sec            1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q35                  1.00      4.0±0.04ms        ? ?/sec            1.00      4.0±0.06ms        ? ?/sec
physical_plan_clickbench_q36                  1.01      5.2±0.07ms        ? ?/sec            1.00      5.2±0.07ms        ? ?/sec
physical_plan_clickbench_q37                  1.01      5.3±0.07ms        ? ?/sec            1.00      5.2±0.06ms        ? ?/sec
physical_plan_clickbench_q38                  1.01      5.2±0.08ms        ? ?/sec            1.00      5.2±0.09ms        ? ?/sec
physical_plan_clickbench_q39                  1.01      4.8±0.07ms        ? ?/sec            1.00      4.8±0.06ms        ? ?/sec
physical_plan_clickbench_q4                   1.00      3.3±0.04ms        ? ?/sec            1.03      3.4±0.05ms        ? ?/sec
physical_plan_clickbench_q40                  1.02      5.4±0.08ms        ? ?/sec            1.00      5.3±0.06ms        ? ?/sec
physical_plan_clickbench_q41                  1.01      5.1±0.07ms        ? ?/sec            1.00      5.1±0.08ms        ? ?/sec
physical_plan_clickbench_q42                  1.01      4.9±0.07ms        ? ?/sec            1.00      4.9±0.06ms        ? ?/sec
physical_plan_clickbench_q43                  1.02      5.1±0.06ms        ? ?/sec            1.00      5.0±0.12ms        ? ?/sec
physical_plan_clickbench_q44                  1.00      3.5±0.04ms        ? ?/sec            1.00      3.5±0.04ms        ? ?/sec
physical_plan_clickbench_q45                  1.00      3.5±0.04ms        ? ?/sec            1.00      3.5±0.05ms        ? ?/sec
physical_plan_clickbench_q46                  1.00      4.1±0.06ms        ? ?/sec            1.00      4.1±0.07ms        ? ?/sec
physical_plan_clickbench_q47                  1.01      5.5±0.11ms        ? ?/sec            1.00      5.5±0.06ms        ? ?/sec
physical_plan_clickbench_q48                  1.00      5.4±0.08ms        ? ?/sec            1.00      5.3±0.07ms        ? ?/sec
physical_plan_clickbench_q49                  1.01      5.7±0.09ms        ? ?/sec            1.00      5.6±0.10ms        ? ?/sec
physical_plan_clickbench_q5                   1.00      3.5±0.04ms        ? ?/sec            1.02      3.6±0.05ms        ? ?/sec
physical_plan_clickbench_q6                   1.00      3.5±0.05ms        ? ?/sec            1.00      3.6±0.07ms        ? ?/sec
physical_plan_clickbench_q7                   1.01      4.0±0.05ms        ? ?/sec            1.00      4.0±0.05ms        ? ?/sec
physical_plan_clickbench_q8                   1.02      3.9±0.13ms        ? ?/sec            1.00      3.8±0.06ms        ? ?/sec
physical_plan_clickbench_q9                   1.01      4.2±0.06ms        ? ?/sec            1.00      4.1±0.07ms        ? ?/sec
physical_plan_tpcds_all                       1.03  1388.8±10.90ms        ? ?/sec            1.00   1350.0±2.64ms        ? ?/sec
physical_plan_tpch_all                        1.03     89.9±0.57ms        ? ?/sec            1.00     87.5±0.61ms        ? ?/sec
physical_plan_tpch_q1                         1.02      3.2±0.02ms        ? ?/sec            1.00      3.2±0.03ms        ? ?/sec
physical_plan_tpch_q10                        1.04      4.4±0.03ms        ? ?/sec            1.00      4.3±0.02ms        ? ?/sec
physical_plan_tpch_q11                        1.03      4.0±0.03ms        ? ?/sec            1.00      3.9±0.03ms        ? ?/sec
physical_plan_tpch_q12                        1.02      3.1±0.02ms        ? ?/sec            1.00      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.03      2.5±0.02ms        ? ?/sec            1.00      2.4±0.01ms        ? ?/sec
physical_plan_tpch_q14                        1.02      2.8±0.02ms        ? ?/sec            1.00      2.7±0.03ms        ? ?/sec
physical_plan_tpch_q16                        1.03      4.0±0.03ms        ? ?/sec            1.00      3.9±0.02ms        ? ?/sec
physical_plan_tpch_q17                        1.03      3.7±0.03ms        ? ?/sec            1.00      3.6±0.02ms        ? ?/sec
physical_plan_tpch_q18                        1.03      4.1±0.03ms        ? ?/sec            1.00      4.0±0.02ms        ? ?/sec
physical_plan_tpch_q19                        1.01      5.7±0.04ms        ? ?/sec            1.00      5.6±0.03ms        ? ?/sec
physical_plan_tpch_q2                         1.03      7.5±0.04ms        ? ?/sec            1.00      7.2±0.05ms        ? ?/sec
physical_plan_tpch_q20                        1.02      4.7±0.03ms        ? ?/sec            1.00      4.6±0.04ms        ? ?/sec
physical_plan_tpch_q21                        1.02      6.1±0.03ms        ? ?/sec            1.00      5.9±0.07ms        ? ?/sec
physical_plan_tpch_q22                        1.02      3.6±0.02ms        ? ?/sec            1.00      3.5±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.03      3.3±0.04ms        ? ?/sec            1.00      3.2±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.02      2.6±0.02ms        ? ?/sec            1.00      2.6±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.03      4.4±0.03ms        ? ?/sec            1.00      4.3±0.05ms        ? ?/sec
physical_plan_tpch_q6                         1.02  1832.7±63.56µs        ? ?/sec            1.00  1800.8±13.81µs        ? ?/sec
physical_plan_tpch_q7                         1.03      5.8±0.03ms        ? ?/sec            1.00      5.6±0.08ms        ? ?/sec
physical_plan_tpch_q8                         1.03      6.8±0.06ms        ? ?/sec            1.00      6.6±0.04ms        ? ?/sec
physical_plan_tpch_q9                         1.04      5.4±0.03ms        ? ?/sec            1.00      5.2±0.03ms        ? ?/sec
physical_select_aggregates_from_200           1.01     29.4±0.18ms        ? ?/sec            1.00     29.1±0.18ms        ? ?/sec
physical_select_all_from_1000                 1.00     39.9±0.17ms        ? ?/sec            1.00     40.0±0.21ms        ? ?/sec
physical_select_one_from_700                  1.00      3.3±0.02ms        ? ?/sec            1.00      3.3±0.02ms        ? ?/sec
physical_theta_join_consider_sort             1.01      3.7±0.02ms        ? ?/sec            1.00      3.6±0.02ms        ? ?/sec
physical_unnest_to_join                       1.00      3.3±0.02ms        ? ?/sec            1.00      3.3±0.02ms        ? ?/sec

Possibly a tiny slowdown (like 1%) but also hard to tell given it is so small

@berkaysynnada
Copy link
Contributor

Possibly a tiny slowdown (like 1%) but also hard to tell given it is so small

It would possibly be more significant when we apply it on concrete types and Expr's.

@blaginin blaginin changed the title Switch to iterative DynNode visiting and rewriting Switch to iterative DynNode and ConcreteTreeNode processing Nov 1, 2024
@blaginin
Copy link
Contributor Author

blaginin commented Nov 1, 2024

Thanks for the comments! 🙏

As for the performance, I removed a few extra Vec creations and pushes. I have also added the implementation for ConcreteNode as @berkaysynnada asked. Benchmarks for me haven't really changed.


As for the tests, I'm not sure I got the idea. This PR reimplements some methods in the interfaces, while the existing tests only call the main implementation. If I remove the test changes, then the current PR won't be tested, which I am not sure is desirable?


Also, I don't see yet how this will work on tree nodes that own their children (e.g., Expr)

I think this is a very valid point, @peter-toth! Expr and also LogicalPlan are both hard to switch to ConcreteNode. The solution will definitely be clumsier than just using a library. But I think the reason we might still want to try is because of:

If your recursive algorithm is very performance-sensitive I would suggest rewriting it to an iterative version regardless.

Apart from the obvious implementation solutions, I was thinking about capturing unpacked objects in closures, but maybe we can think of something better? This is definitely a very big piece of work, with a lot of discussion needed and potentially quite a few breaking changes...

However, I am not sure we should do this discussion as part of this PR. My idea here was to fix the cases that can easily be fixed. So now, even without breaking changes, we can fix quite a few of the existing structs and future ones that will implement the supported interface. And even if we decide to introduce tools like recursive later, I’d argue that we’d still aim to use them only when an iterative approach isn’t feasible, as the docs suggest.

@peter-toth
Copy link
Contributor

peter-toth commented Nov 2, 2024

Thank you @blaginin for improving the PR and pointing me to those conversations. Unfortunately I missed those and wasn't aware that there have been an attempt to add stack growth libraries.

I followed the commits yesterday and saw how each of them made the implementation better and better. I could only suggest minor changes like TransformingState::ProcessingChildren probably doesn't need to 2 separate vecs, just an index to track which child nodes have been transformed.

But, on the other hand, I also looked into stacker and recursive. stacker seems like a very mature crate that has been used in many projects, while recursive is just a convenience layer over stacker to allow using it even simpler.

Honestly, I don't share the concerns about these libraries. stacker is owned by the rust-lang team and even the Rust compiler itself uses it: https://github.com/rust-lang/rust/blob/master/compiler/rustc_data_structures/src/stack.rs#L14-L22. recursive seems to be created by the Polars guys, and looks like almost only they use it.

By running some benchmarks I was referring to running the old vs new APIs on large trees with many nodes. IMO the planning benchmarks tell a lot about everyday queries, but doesn't contain any extreme cases with very large trees.

I forked your branch yesterday: https://github.com/peter-toth/datafusion/commits/recursive-vs-iterative/ (not the latest state, but after the visit improvement) and in 3 commits separated the old (recursive) and new (iterative) visit API implementations + added recursive to the old + added a small benchmark to just quickly test the old vs new visit API on tall and wide Arc<DynTestNode> trees:

% cargo bench --bench tree_node

visit old tall tree     time:   [377.79 µs 380.87 µs 385.39 µs]
                        change: [-3.2937% -2.2943% -1.1798%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low mild
  5 (5.00%) high severe

visit new tall tree     time:   [349.62 µs 351.73 µs 353.89 µs]
                        change: [-3.9519% -3.2991% -2.6432%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  5 (5.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
  6 (6.00%) high severe

Benchmarking visit old wide tree: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 86.0s, or reduce sample count to 10.
visit old wide tree     time:   [875.28 ms 879.02 ms 882.86 ms]
                        change: [-1.2651% -0.7020% -0.0989%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking visit new wide tree: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 93.1s, or reduce sample count to 10.
visit new wide tree     time:   [913.97 ms 915.78 ms 917.58 ms]
                        change: [-0.0531% +0.3441% +0.6875%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

It looks like the 2 implementations are comparable. Maybe the recursive approach + stacker is a bit faster when it comes to large trees (879 vs 915 ms). (But as we discussed, most likely Expr is where the iterative approach would suffer...)

But overall, I feel I'm still leaning tovards stacker for a few reasons:

  • On recursive data structures the recursive algorithms are natural, I find it much easier to follow the old API implementation than the new one.
  • I have no concerns with stacker / recursive and they seem to provide a general way to deal with deeply nested recursive calls. Also, these crates can be useful in other areas than TreeNodes.
  • The apply / visit / transform / rewrite APIs are the most common ones to process trees, but let's not forget about rules that self organize their tree recursion via apply_children / map_children in a recursive manner. I do like that we provide these lower level APIs for unique usecases while they can keep the natural recursive implementations. This wouldn't be possible with this PR but the rules itself needed to be rewritten in an iterative way.
  • As we discussed there are trees like Expr where the iterative approach would be much clumsier. Sure, we could a deconstruct a node and store its reconsturction closure on the stack and then reconstruct it once we have transformed their children, but I feel the current map_children is so much cleaner in this case.
  • There could be extreme cases with TreeNodes as well. I'm thinking of _with_subqueries() methods of LogicalPlan in which we switch between 2 kinds of trees (LogicalPlan / Expr) as many times as needed. So stack overflow can happen due to multiple different trees are nested into each other and this can be trivially handled using stacker, but you would need to store multiple kinds of TreeNodes in your heap based stack, which is again not that straightforward.

@blaginin
Copy link
Contributor Author

blaginin commented Nov 2, 2024

It looks like the 2 implementations are comparable. Maybe the recursive approach + stacker is a bit faster when it comes to large trees (879 vs 915 ms).

That's true! I also checked your benchmark on different tree heights
plot

Regarding your reasoning on stacker, I think we both agree that a library implementation would be less awkward, but the "cost of support" argument argument still makes sense to me. @alamb, I think it should be your decision since you made the original point about recursion – I'm happy to submit a PR with stacker so we can further test it if needed

@blaginin
Copy link
Contributor Author

blaginin commented Nov 2, 2024

And one more question on a slightly related topic. Regardless of the option we choose here, we’ll still have the same "triangles" on the profiling charts, where we don’t optimize the next item until we finish optimizing the previous one.

Arc 2024-11-02 17 25 32

Can the actual issue be not the stack size but the execution order in general? Surely some optimisation rules could be applied to all children and subqueries in parallel, potentially giving a significant performance boost on compound queries?

Sorry if you’ve already discussed this anywhere, but I couldn’t find anything on GitHub. I get that not all optimisation rules can run in parallel, but if this challenge was taken into account, it feels like the whole context of this discussion would look pretty different 🤔

@peter-toth
Copy link
Contributor

Are you refering to visiting / transforming a node's children in parallel on multiple threads? Unfortunately, I'm not sure either if this idea was discussed earlier.

I remember I was thinking about it when we consolidated the tree traversal semantics and formed the current APIs. But as far as I can recall I didn't find that many TreeNode API usecases that would have benefited from parallel processing and if they did then would have required significant changes. But I might be wrong and others have more insights on existing rules / transformations in DataFusion. (I'm more familiar with Spark honestly, and there are no parallel tree processing APIs there.) Anyways, if we wanted to offer such APIs I would not change the current ones, but add completely new ones.

@berkaysynnada
Copy link
Contributor

What are our plans for this PR? Do we all agree to apply these changes? If so, I would like to do a final detailed review.

@peter-toth
Copy link
Contributor

Well, I'm still on the side of keeping the recursive way and use stacker / recursive (like @blaginin implemented in this commit: blaginin@9eaca84) if we all can rely on that crate. My points are in this comment: #13177 (comment).

@blaginin
Copy link
Contributor Author

blaginin commented Nov 5, 2024

Hey @berkaysynnada 👋 Yes, Peter is right - I think we should decide on the approach before starting the actual review. I tagged Andrew above, because he was the one to propose recursive approach. When he has time to review, I'll either work on this PR or start a new one with stacker (or do something completely different 😅)

@alamb
Copy link
Contributor

alamb commented Nov 6, 2024

I'll try and take a closer look later today

@alamb
Copy link
Contributor

alamb commented Nov 7, 2024

Well, I'm still on the side of keeping the recursive way and use stacker / recursive (like @blaginin implemented in this commit: blaginin@9eaca84) if we all can rely on that crate. My points are in this comment: #13177 (comment).

I am very worried about using stacker or some other similar crate -- while it sounds like it solves all the problems easily I have the following concerns:

  1. It is not widely used (only 53 dependencies: https://crates.io/crates/stacker/reverse_dependencies)
  2. It was dormant for over 2 years and only recently has started getting updates: https://crates.io/crates/stacker/versions
  3. I worry it will cause some sort of hard to diagnose issue that appears only under load / memory pressure / etc

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you very much @blaginin and @berkaysynnada and @peter-toth -- this is a fairly epic PR

My feeling is that after reading this PR is that there is a very large amount of complexity that arises from ExecutionPlans that don't have ownership of their children (aka they are Arc<dyn ExecutionPlan> rather than Box<dyn ExecutionPlan>

If we could actually modify the ExecutionPlans (take children, add new children) I think this code would likely be much simpler as well as more performant. We could also potentially offer in-place mutation APIs that would be even faster.

What do people think about trying to push through a change to avoid Arc<dyn ExecutionPlan> ?

Has anyone verified that this change fixes the issue described on ?

#9373


/// Detaches children from the parent node (if possible).
/// Unlike [`ConcreteTreeNode`] it doesn't possible that the value will actually be removed from the parent
fn take_children(self) -> (Self, Vec<Self>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these methods be moved into TreeNode itself? It feels like it just makes things additionally complicated to try and implement a separate set of APIs that are related but not quite the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed RecursiveNode in 403230a by moving its logic into ConcreteTreeNode (though I’m not sure if this could be done directly in TreeNode).

I also linked DynTreeNode and ConcreteTreeNode. This change might be debatable, so I’m happy to discuss further if needed :)

@blaginin
Copy link
Contributor Author

blaginin commented Nov 7, 2024

Has anyone verified that this change fixes the issue described on ?

This PR doesn't resolve that specific issue with Logical Plans, although it does fix a problem with processing large Physical Plans. I think more work is needed to fix the original issue (including the child ownership you mentioned), but I think this PR is already quite big already

@peter-toth
Copy link
Contributor

I am very worried about using stacker or some other similar crate -- while it sounds like it solves all the problems easily I have the following concerns:

  1. It is not widely used (only 53 dependencies: https://crates.io/crates/stacker/reverse_dependencies)
  2. It was dormant for over 2 years and only recently has started getting updates: https://crates.io/crates/stacker/versions
  3. I worry it will cause some sort of hard to diagnose issue that appears only under load / memory pressure / etc

The fact that Rust compiler uses stacker doesn't alleviate your concerns? I mean, it might have been dormant for some time but as long as the compiler needs it, it seems safe to use it.

My feeling is that after reading this PR is that there is a very large amount of complexity that arises from ExecutionPlans that don't have ownership of their children (aka they are Arc<dyn ExecutionPlan> rather than Box<dyn ExecutionPlan>

If we could actually modify the ExecutionPlans (take children, add new children) I think this code would likely be much simpler as well as more performant. We could also potentially offer in-place mutation APIs that would be even faster.

What do people think about trying to push through a change to avoid Arc<dyn ExecutionPlan> ?

I think that's a bit different topic, both the current recursive and proposed iterative approaches could benefit if taking a node's children and reattaching them would be possible. I don't think this is the main differentiator between the 2.

I think what we need to decide is if we can rely on stacker:

  • If we can, then we can keep the IMHO cleaner recursive implementation with minimal changes.
  • If we cannot, then we need to switch to the iterative approach. But fixing all possible stack overflows will be a long journey with iterative implementations. The difficulties will arrise with the last 3 points of my comment: Changing the custom recursive rules to avoid using map_children, implementing take_children / with_new_children for Expr and finally mixed tree traversal (e.g. LogicalPlan's ..._with_subqquery() methods).

@peter-toth
Copy link
Contributor

peter-toth commented Nov 8, 2024

Has anyone verified that this change fixes the issue described on ?

#9373

Actually, this is a good question. I've just run this PR locally on the attached example and I got stack overflow. The root cause is not the TreeNode APIs this time, but SQL to logical plan conversion:

* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f607280)
    frame #0: 0x0000000100b246d4 datafusion-cli`datafusion_sql::set_expr::_$LT$impl$u20$datafusion_sql..planner..SqlToRel$LT$S$GT$$GT$::set_expr_to_plan::h807844852e1e805e(self=0x0000000000000000, set_expr=<unavailable>, planner_context=0x0000000000000000) at set_expr.rs:24
   21  	use sqlparser::ast::{SetExpr, SetOperator, SetQuantifier};
   22
   23  	impl<'a, S: ContextProvider> SqlToRel<'a, S> {
-> 24  	    pub(super) fn set_expr_to_plan(
   25  	        &self,
   26  	        set_expr: SetExpr,
   27  	        planner_context: &mut PlannerContext,
Target 0: (datafusion-cli) stopped.
(lldb) thread backtrace
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f607280)
  * frame #0: 0x0000000100b246d4 datafusion-cli`datafusion_sql::set_expr::_$LT$impl$u20$datafusion_sql..planner..SqlToRel$LT$S$GT$$GT$::set_expr_to_plan::h807844852e1e805e(self=0x0000000000000000, set_expr=<unavailable>, planner_context=0x0000000000000000) at set_expr.rs:24
    frame #1: 0x0000000100b24a98 datafusion-cli`datafusion_sql::set_expr::_$LT$impl$u20$datafusion_sql..planner..SqlToRel$LT$S$GT$$GT$::set_expr_to_plan::h807844852e1e805e(self=0x000000016fd8eef8, set_expr=SetExpr @ 0x000000016f60fc68, planner_context=0x000000016fd5ec10) at set_expr.rs:38:33
    frame #2: 0x0000000100b24a98 datafusion-cli`datafusion_sql::set_expr::_$LT$impl$u20$datafusion_sql..planner..SqlToRel$LT$S$GT$$GT$::set_expr_to_plan::h807844852e1e805e(self=0x000000016fd8eef8, set_expr=SetExpr @ 0x000000016f613a28, planner_context=0x000000016fd5ec10) at set_expr.rs:38:33
    frame #3: 0x0000000100b24a98 datafusion-cli`datafusion_sql::set_expr::_$LT$impl$u20$datafusion_sql..planner..SqlToRel$LT$S$GT$$GT$::set_expr_to_plan::h807844852e1e805e(self=0x000000016fd8eef8, set_expr=SetExpr @ 0x000000016f6177e8, planner_context=0x000000016fd5ec10) at set_expr.rs:38:33
...

As far as I see the issue was reported with DataFusion 36.0.0 and this problematic recursive set_expr_to_plan might have appeared since that and so it hides the originaly reported TreeNode API issue. Or my debug build can cause that I run into set_expr_to_plan before the TreeNode APIs kick in.

@blaginin
Copy link
Contributor Author

blaginin commented Nov 8, 2024

Hey, I can rewrite set_expr_to_plan to be iterative in a separate PR (as this one is quite big already), but I think it still won't resolve the fundamental issue of recursive tree traversal (so discussion above is still relevant)

@peter-toth
Copy link
Contributor

Meanwhile I opened #13310 to test how the change would look like using stacker and recursive.
I was able to successfully run the blowout.sql example of #9373 with that PR.

@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

I worry it will cause some sort of hard to diagnose issue that appears only under load / memory pressure / etc
The fact that Rust compiler uses stacker doesn't alleviate your concerns? I mean, it might have been dormant for some time but as long as the compiler needs it, it seems safe to use it.

I did not realize it was used by the rust compiler.

I just did some research and verified it is indeed used that way:

https://github.com/rust-lang/rust/blob/209799f3b910c64c8bd5001c0a8a55e03e7c2614/compiler/rustc_data_structures/src/stack.rs#L21

https://github.com/search?q=repo%3Arust-lang/rust%20ensure_sufficient_stack&type=code

Based on htat new information I think stacker seems like a reasonable way to go.

@blaginin
Copy link
Contributor Author

blaginin commented Nov 8, 2024

Thanks @alamb! I think we should then merge #13310 to resolve the issue

@blaginin blaginin closed this Nov 8, 2024
@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

Sorry for the runaround @blaginin -- we appreciate all the work you have done on this issue

@blaginin
Copy link
Contributor Author

blaginin commented Nov 8, 2024

It was a really cool discussion here, and I really appreciate all the reviews! Very happy the issue will be solved in a good way!! 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants