Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Boxes with Arc in the Expr enum. #9577

Open
mustafasrepo opened this issue Mar 12, 2024 · 17 comments
Open

Replace Boxes with Arc in the Expr enum. #9577

mustafasrepo opened this issue Mar 12, 2024 · 17 comments
Labels
enhancement New feature or request

Comments

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Mar 12, 2024

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

According to following stackoverflow discussion. Boxs can deep copy when called with .clone() method (according to .clone() implementation of the underlying type.).

For Box<Expr> this is the case. I think this usage might be the reason of some deep stack usages seen during the planning.

See related issues: #9375, #8837.

I think, replacing Box<Expr> usages with Arc<Expr> under the enum Expr would improve performance. I am not familiar with the implications of these two approaches in other places. I wonder what community thinks about this change. Would it be better, unnecessary, etc?

Describe alternatives you've considered

No response

Additional context

No response

@mustafasrepo mustafasrepo added the enhancement New feature or request label Mar 12, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 12, 2024

Is it possible to remove Box to lower the cost? It seems we need to deal with recursive types

Summary for my note:

  1. whether Box<T> does shallow copy or deep copy depends on T, not always deep copy.
  2. Rc<T> / Arc<T> has overhead so whether they are faster than Box<T> may need benchmarks to ensure that.

@mustafasrepo
Copy link
Contributor Author

Is it possible to remove Box to lower the cost? It seems we need to deal with recursive types

Summary for my note:

  1. whether Box<T> does shallow copy or deep copy depends on T, not always deep copy.
  2. Rc<T> / Arc<T> has overhead so whether they are faster than Box<T> may need benchmarks to ensure that.

Exactly. I also think that there are pros and cons of these approaches. For recursive types, I think deep clone problem is more evident.

@comphead
Copy link
Contributor

It would be an interesting experiment,
@jayzhan211 please elaborate how Box does shallow copy? I checked
https://github.com/rust-lang/rust/blob/3cbb93223f33024db464a4df27a13c7cce870173/library/alloc/src/boxed.rs#L1306

It has its own .clone for &str and [] but I didnt see how the same type can be switched between shallow and deep copy.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 14, 2024

It would be an interesting experiment, @jayzhan211 please elaborate how Box does shallow copy? I checked https://github.com/rust-lang/rust/blob/3cbb93223f33024db464a4df27a13c7cce870173/library/alloc/src/boxed.rs#L1306

It has its own .clone for &str and [] but I didnt see how the same type can be switched between shallow and deep copy.

I don't know how to demonstrate if it is shallow copy in rust playground
The summary is inferred that Box::clone is doing T::clone, see write_clone_into_raw, it does self.clone() which is T.clone()

#[cfg(not(no_global_oom_handling))]
/// Specialize clones into pre-allocated, uninitialized memory.
/// Used by `Box::clone` and `Rc`/`Arc::make_mut`.
pub(crate) trait WriteCloneIntoRaw: Sized {
    unsafe fn write_clone_into_raw(&self, target: *mut Self);
}

#[cfg(not(no_global_oom_handling))]
impl<T: Clone> WriteCloneIntoRaw for T {
    #[inline]
    default unsafe fn write_clone_into_raw(&self, target: *mut Self) {
        // Having allocated *first* may allow the optimizer to create
        // the cloned value in-place, skipping the local and move.
        unsafe { target.write(self.clone()) };
    }
}

#[cfg(not(no_global_oom_handling))]
impl<T: Copy> WriteCloneIntoRaw for T {
    #[inline]
    unsafe fn write_clone_into_raw(&self, target: *mut Self) {
        // We can always copy in-place, without ever involving a local value.
        unsafe { target.copy_from_nonoverlapping(self, 1) };
    }
}

Based on https://stackoverflow.com/questions/31012923/what-is-the-difference-between-copy-and-clone, Doc for clone and many random comments. Clone do either shallow copy or deep copy.

Differs from Copy in that Copy is implicit and an inexpensive bit-wise copy, while Clone is always explicit and may or may not be expensive

I think types like &T, Rc, Arc are shallow copy (clone the pointer only not the underlying data)

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

I think, replacing Box usages with Arc under the enum Expr would improve performance.

I basically agree with this assessment but I have an alternate proposal for how to improve performance

  1. I agree that the current clone() of Exprs (and Box::clone does a deep copy)
  2. I believe that the cloneing of Exprs takes a significant amount of planning time in DataFusion (I was looking at cargo bench --bench sql_planner the other day and substantial amounts of time are spent cloning
  3. I think the "ideal" rust pattern for rewriting Exprs is what we have in our ExprRewriters -- take an owned Expr and then destructure / rewrite it as needed
  4. There are many places in our code where Expr::clone() is called Unecessirly
  5. If we changed Expr to use Arc instead of Box I think we would get a planning speedup, but it would be less performant than the ideal pattern (as rewritten nodes would likely rewrite copying, much like LogicalPlan), and it would be a large API change for downstream consumers

Here is an example of what I think is a good pattern (there are no copies except when needed)

https://github.com/apache/arrow-datafusion/blob/9d0c05b9c2e5f9e774ab1ea08599ac8d4b23c93f/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L638-L648

Here is an example of where Expr cloning is being used unnecessarily

https://github.com/apache/arrow-datafusion/blob/9d0c05b9c2e5f9e774ab1ea08599ac8d4b23c93f/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L45-L49

Thus my suggestion is to go through the planner and remove the calls to Expr::clone() as much as possible (likely letting cargo bench --bench sql_planner be our guide.

This would avoid any changes required for downstream consumers

@mustafasrepo
Copy link
Contributor Author

Thanks @alamb for your answer. I also think that removing existing unnecessary .clones, replacing them with owned variants is a better approach with keeping the type as is. If you have some preliminary findings such as "Rule x has a lot of .clone with lots of overhead." We can prioritize those sections.

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

I did some profiling locally by running the following

cargo bench --bench sql_planner -- physical_plan_tpch_all

My analysis is that almost 40% of the planning time is spent in SimplifyExprs and CommonSubexprEliminate

I suspect there are many ways we could reduce clones in those passes

Screenshot 2024-03-14 at 11 07 57 AM

@jayzhan211
Copy link
Contributor

I'm interested in optimizing these (avoid clones), btw What is this profiling tool?

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

I'm interested in optimizing these (avoid clones),

❤️

btw What is this profiling tool?

I used Instruments (CPU profiler) that comes as part of XCode on Mac OSX

I have also used hotspot for Linux https://github.com/KDAB/hotspot which has similar capabilities

Maybe I should make a video about "how to profile / interpret stack traces to optimize DataFusion" 🤔

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

BTW I think #9140 would be a good first start as the inlist simplifier both uses clones as well as does (yet another) tree walk

Maybe we could port it over into the main ExprSimplifer loop in a few PRs 🤔

@comphead
Copy link
Contributor

Maybe I should make a video about "how to profile / interpret stack traces to optimize DataFusion" 🤔

Would be great: video or screens, whatever works and we can attach it to DF docs

@alamb
Copy link
Contributor

alamb commented Mar 16, 2024

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@alamb
Copy link
Contributor

alamb commented Mar 16, 2024

I thought about this challenge last night and wrote up my thoughts here: #9637

@alamb
Copy link
Contributor

alamb commented Mar 20, 2024

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?

@comphead
Copy link
Contributor

comphead commented Mar 20, 2024

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?

I'd say its great, thanks @alamb, the font not always clear though, but the video gives the understanding what should be happening.

Today/tomorrow I'm planning to add a profiling doc for MacOS only, how to do a profiling and build flamegraphs and also include this Youtube link. Unix and Window related contributors can add their part later

@comphead
Copy link
Contributor

#9711

@jayzhan211
Copy link
Contributor

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?

@alamb Thanks for your video, I think it is really helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants