Optimize date_trunc function by avoiding allocations#18360
Optimize date_trunc function by avoiding allocations#18360alamb wants to merge 3 commits intoapache:mainfrom
date_trunc function by avoiding allocations#18360Conversation
The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.
| (Nanosecond, "minute") => Some(Int64Array::new_scalar(60_000_000_000)), | ||
| (Nanosecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000_000)), | ||
| (Nanosecond, "day") => Some(Int64Array::new_scalar(86_400_000_000_000)), | ||
| let unit: Option<i64> = match (tu, granularity) { |
There was a problem hiding this comment.
The key idea here is to make this code faster by reusing the allocation and operating in place rather than allocating new arrays
date_trunc moredate_trunc function by avoiding allocations
| i.checked_div(unit) | ||
| .ok_or_else(|| exec_datafusion_err!("division overflow")) | ||
| })?; | ||
| let array = try_unary_mut_or_clone(array, |i| { |
There was a problem hiding this comment.
technically speaking, only the first try_unary_mut_or_clone is needed
on the second transformation, we're guaranteed to be the pointer into the array, and the so taking the or_clone path would be an error
There was a problem hiding this comment.
This is true, though I don't know how to represent this in code.
Maybe I could make a second function try_unary_mut_or_error that throws a runtime error 🤔
| fn try_unary_mut_or_clone<F>( | ||
| array: PrimitiveArray<Int64Type>, | ||
| op: F, | ||
| ) -> Result<PrimitiveArray<Int64Type>> | ||
| where | ||
| F: Fn(i64) -> Result<i64>, |
There was a problem hiding this comment.
not really date_trunc specific. can this be made more flexible with a more generous use of generics?
perhaps it could even be in arrow-rs. it makes try_unary_mut significantly more approachable
There was a problem hiding this comment.
yes, I agree -- the try_unary_mut is quite awkward to use. I will see if I can port some of these changes upstream / see what they look like
There was a problem hiding this comment.
- I filed Hard to use
PrimitiveArray::unary_mut,PrimitiveArray:try_unary_mut, etc arrow-rs#8808 to track this idea
|
🤖 |
|
🤖: Benchmark completed Details
|
|
We went with a dfferent approach , and I have filed the follow on ticket |
Which issue does this PR close?
DATE_TRUNCexpression regression for values before epoch #18334Rationale for this change
@mhilton's change to
date_truncto make it correct also potentially slows it downLet's try and recover the performance (and then some) so DataFusion 51 can be both more correct and faster
What changes are included in this PR?
Use
try_unary_opmethods to avoid allocating intermediate arrays to improve performanceAre these changes tested?
functionally by CI
I will run benchmarks on this PR as well
Are there any user-facing changes?