-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Format Date32
to string given timestamp specifiers
#15361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Format Date32
to string given timestamp specifiers
#15361
Conversation
52e1d29
to
1e57ed1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- something seems strange to me about this PR in that it is catching errors and reformatting. It seems like the allowable format options should be known based on the input data type 🤔
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result<ColumnarValue> { | |||
let result = formatter.value(idx).try_to_string(); | |||
match result { | |||
Ok(value) => results.push(Some(value)), | |||
Err(e) => return exec_err!("{}", e), | |||
Err(e) => { | |||
if data_type == &Date32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is handling errors -- shouldn't we just dispach on type when building the format options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is handling errors -- shouldn't we just dispach on type when building the format options?
Sorry.
I assume you mean detect whether the format options contain time specifiers ahead of time and if so cast the Date32
into a Date64
.
Or are you saying avoid the cast altogether? A FormatOption.with_datetime_format()
won't work with Date32
, as Arrow doesn't consider the datetime_format
field when formatting a Date32
.
#[test]
fn unaffected_format() {
let date_format = "%Y-%m %H";
let cast_options = CastOptions {
safe: false,
format_options: FormatOptions::default().with_datetime_format(Some(date_format)),
};
let array = Date32Array::from(vec![10000, 17890]);
let res = cast_with_options(&array, &DataType::Utf8, &cast_options);
dbg!(&res); // StringArray(["1997-05-19", "2018-12-25"])
}
Initially, I figured if the initial format attempt fails with a Date32
, it'll retry as a Date64
. This is similar to how Postgres tries to parse an interval string under the normal standard, and if that errs, it tries to parse the string again using the ISO8601 standard.
Regardless, I'll be happy to refactor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I do find it very hard to understand all the nested error handling
Perhaps you could pull the retry logic into a function that handles the retry and is well documented (I think your description above makes a lot of sense, but it is not at all clear from the code in my mind)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look and reached a different conclusion from above. I fear 1e57ed1 is performing way too many casts in the retry logic, and it is better to implicitly treat Date32
as timestamps ahead of time.
to_char
invokes either to_char_scalar
or to_char_array
based on the number of formats provided in the arguments (i.e. N dates with 1 format string or N dates with N format strings). If any of the dates err when formatting, we'd cast all N dates as Date64
s and redo the body of work.
While this would be fine with to_char_scalar
because a format string with a time-specifier would err for any Date32
, so at worst it'd err, cast, and retry after the first format attempt. But in to_char_array
where we're iterating through N format strings, at worst we'd need to recast the entire Date32
array N times*.
Since we can't avoid the Date32
to Date64
cast to support this feature, it's much simpler to check ahead of time and cast the input Date32
array into a Date64
array. Plus, all date-specifiers are valid in timestamp formatting, so existing features work as expected.
I pushed ec2373a as a proof of concept. But I'd be happy to rework it as you see fit.
*The entire Date32
array because for every format string, we create a new ArrayFormatter
by supplying the entire array and specifying the index for the formatter to format:
datafusion/datafusion/functions/src/datetime/to_char.rs
Lines 274 to 277 in fdb4e84
// this isn't ideal but this can't use ValueFormatter as it isn't independent | |
// from ArrayFormatter | |
let formatter = ArrayFormatter::try_new(arrays[0].as_ref(), &format_options)?; | |
let result = formatter.value(idx).try_to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in to_char_array where we're iterating through N format strings, at worst we'd need to recast the entire Date32 array N times.
We could make the Date32
array mut
and keep it outside of the loop. Then when we encounter the first failed format attempt, we'd cast the date array to a Date64
array and treat the subsequent dates as Date64
s. This way, we limit the cast to only once per batch.
But this adds a lot of unnecessary code complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a bit and was mentally toying with the idea of testing the format string(s) for time formats vs just casting always ... and generally I don't think it would be worth it at all. Casting Date32 to Date64 should be incredibly cheap I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*The entire
Date32
array because for every format string, we create a newArrayFormatter
by supplying the entire array and specifying the index for the formatter to format:datafusion/datafusion/functions/src/datetime/to_char.rs
Lines 274 to 277 in fdb4e84
// this isn't ideal but this can't use ValueFormatter as it isn't independent // from ArrayFormatter let formatter = ArrayFormatter::try_new(arrays[0].as_ref(), &format_options)?; let result = formatter.value(idx).try_to_string();
Yeah, this is a limitation of arrow-rs currently. I never got around to pushing a PR to arrow-rs to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting Date32 to Date64 should be incredibly cheap I think
I agree, converting involves a cast between primitives and a mult.
Yeah, this is a limitation of arrow-rs currently. I never got around to pushing a PR to arrow-rs to change that.
I'd be happy to work on this next, if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a limitation of arrow-rs currently. I never got around to pushing a PR to arrow-rs to change that.
I'd be happy to work on this next, if you'd like.
It would clean things up but I'm not sure if it would be a measurable performance improvement or not. Leaning to it not being all that noticeable but I'm often wrong :)
88d304e
to
ec2373a
Compare
@@ -220,6 +221,13 @@ fn _to_char_scalar( | |||
} | |||
} | |||
|
|||
// eagerly cast Date32 values to Date64 to support date formatting with time-related specifiers | |||
// without error. | |||
if data_type == &Date32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @Omega359 says, this will now penalize performance for all existing Date32 columns
Is there any way we can check if the format string contains any time related specifiers before doing this conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to check for time formats in the format string however I'm not sure that cost would be any less than just doing the conversion up front. I think we need a benchmark to have quantified data vs assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some data from a quick little benchmark I wrote
test_date32_to_date64_cast_array_1000
time: [311.06 ns 314.48 ns 318.16 ns]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
test_parse_performance_with_time_specifiers_array_1000
time: [343.79 ns 351.64 ns 359.98 ns]
Found 14 outliers among 100 measurements (14.00%)
10 (10.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
test_parse_performance_without_time_specifiers_array_1000
time: [196.59 µs 201.06 µs 206.45 µs]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
test_date32_to_date64_cast_array_1000: just casts from date32 to date64
test_parse_performance_with_time_specifiers_array_1000: parses the formats looking for time specifiers and when found does the cast from date32 to date64
test_parse_performance_without_time_specifiers_array_1000: parses the formats looking for time specifiers (doesn't find anything), no cast
Note that results on my machine will vary +-10% between runs. The check for time specifiers in the format is simple and could be improved but I think is enough to show general performance. Note that the list of time specifiers is not complete
const TIME_SPECIFIERS: &[&str] = &[
"%H", "%M", "%S", "%c", "%f", "%k", "%I", "%l", "%p", "%R", "%T", "%x", "%r", "%Z",
];
fn has_time_specifiers(str: &str) -> bool {
for specifier in TIME_SPECIFIERS {
if str.contains(specifier) {
return true;
}
}
false
}
A couple of takeaways from this. casting is not as cheap as I thought however parsing is seems to be more expensive than that but perhaps with some really good optimizations it could be reduced.
I don't see a good way to make this feature have no cost with Date32 && no time specifiers in the format.
I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats (main...Omega359:arrow-datafusion:to_char_date32_with_time_formats) I haven't done a comparison with the benchmark results from main yet. |
I was planning on getting to this during the weekend, but wow, this is super awesome @Omega359! I love how Let me know how you want to proceed, I'd be happy to help out with benchmarking. |
@friendlymatthew - Running the benchmarks between main and the sidebranch would be very helpful - it should give us an idea as to the overhead of the format parsing for date32 data. |
Sounds great. I'll have some free time tomorrow to work on this. |
@Omega359 would you be open to committing to this branch? This way, I could base the benchmarking work on top of your work. If we decide to on a different approach than the experimented one, we could just roll back to a prior commit. |
Your branch you mean? I can try and push a PR to your branch once I figure out how :) |
I just sent an invite to collaborate on my fork. From there you should be able to directly push a commit to this branch! |
I messed up and pushed directly to your branch vs a PR, sorry about that @friendlymatthew |
Not at all, thank you and happy to collaborate on this together! |
Some benchmark results when comparing main with this branch: (TLDR) This branch will make the existing I compared two benchmark functions: These functions are concerned with formatting dates with date-only specifiers. From main: to_char_array_date_only_patterns_1000
time: [136.39 µs 136.68 µs 137.03 µs]
to_char_scalar_date_only_pattern_1000
time: [92.084 µs 95.999 µs 100.12 µs] And when compared to this branch: to_char_array_date_only_patterns_1000
time: [173.46 µs 174.21 µs 175.10 µs]
change: [+25.159% +25.881% +26.651%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild
to_char_scalar_date_only_pattern_1000
time: [105.04 µs 109.46 µs 113.80 µs]
change: [+6.6648% +11.410% +16.429%] (p = 0.00 < 0.05)
Performance has regressed. To reproduce:You can run the two benchmark functions by: cargo bench --package datafusion-functions "to_char_array_date_only_patterns_1000|to_char_scalar_date_only_pattern_1000" Since the benchmark function names clash between main and this branch, you can consider this branch as main. (This branch only contains the refactor benchmark logic). cc/ @Omega359 |
Slowing down existing functionality for current users for some new functionality doesn't sound like a great tradeoff to me -- if users want this behavior they can always override the to_char implementation If we can retain the same performance for existing to_char but add the new feature I think that is much more reasonable |
It's only slower when date32 is the input type. It's going to be hard to have this functionality for formats with timestamp specifiers but not have some sort of impact on those with just date specifiers. The only way I can think of to handle that is via catching the error when the parsing fails and retrying with date64. To put some context on this:
|
It may be beneficial to leave the existing code paths undisturbed and decide whether to retry on err. It avoids the overhead of custom to_char_scalar let formatter = ArrayFormatter::try_new(array.as_ref(), &format_options)?;
let formatted: Result<Vec<Option<String>>, ArrowError> = (0..array.len())
.map(|i| {
if array.is_null(i) {
Ok(None)
} else {
formatter.value(i).try_to_string().map(Some)
}
})
.collect(); /*
will stop iterating upon the first err and
since we're dealing with one format string,
it'll halt after the first format attempt
*/
if let Ok(formatted) = formatted {
if is_scalar_expression {
Ok(ColumnarValue::Scalar(ScalarValue::Utf8(
formatted.first().unwrap().clone(),
)))
} else {
Ok(ColumnarValue::Array(
Arc::new(StringArray::from(formatted)) as ArrayRef
))
}
} else {
// if the format attempt is with a Date32, it's possible the attempt failed because
// the format string contained time-specifiers, so we'll retry as Date64s
if data_type == &Date32 {
return to_char_scalar(expression.clone().cast_to(&Date64, None)?, format);
}
exec_err!("{}", formatted.unwrap_err())
}
FWIW, it's worth implementing the retry logic and benchmarking the performance. I'm happy to do the work. The decision between eager casting and selective retry likely comes down to whether we're willing to slightly slow down existing behavior in exchange for better performance in the new case. But I'm just spitballing. |
a855bb0
to
c69180b
Compare
76919c4
to
01b1bc0
Compare
Note: I think I found a better solution. I'll update this report when ready f906af5 is a proof of concept that involves selective retry on err. There's benchmark results below! Quick overviewThe The Another method I considered was casting the input I don't really like this, because I'd like to cast from BenchmarksSelective retry vs. mainThe selective retry doesn't disturb the existing code paths. When benchmarking with main, we see little to no variance. # The selective retry code was benched first. This is the benchmark result when running on main.
to_char_array_date_only_patterns_1000
time: [136.87 µs 137.13 µs 137.49 µs]
change: [-0.9784% -0.7511% -0.4721%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
2 (2.00%) high mild
7 (7.00%) high severe
to_char_scalar_date_only_pattern_1000
time: [101.26 µs 104.97 µs 108.40 µs]
change: [-1.5505% +2.7529% +7.4144%] (p = 0.21 > 0.05)
No change in performance detected.
Eager cast vs. selective retrySince selective retry doesn't disturb the existing paths, those cases also run much faster when compared to the eager cast approach. Plus, the selective retry approach sees a 3% improvement in The tradeoff here is that the new datetime formatting features will suffer and is slower than the eager cast approach for the contrived case that @Omega359 mentioned here. (1000 But this isn't the worst thing in the world. In a case where someone wants to format a 1000 # The eager cast code was benched first.
# This is the benchmark result when running the selective retry code.
to_char_array_date_only_patterns_1000
time: [137.99 µs 138.20 µs 138.44 µs]
change: [-21.564% -20.959% -20.538%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
to_char_array_datetime_patterns_1000
time: [503.48 µs 506.55 µs 509.87 µs]
change: [+110.95% +112.69% +114.12%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
to_char_array_mixed_patterns_1000
time: [323.05 µs 325.66 µs 328.11 µs]
change: [+65.744% +66.705% +67.701%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
to_char_scalar_date_only_pattern_1000
time: [96.153 µs 100.02 µs 103.88 µs]
change: [-12.058% -8.2654% -4.3595%] (p = 0.00 < 0.05)
Performance has improved.
to_char_scalar_datetime_pattern_1000
time: [177.38 µs 186.04 µs 194.64 µs]
change: [-11.383% -7.3528% -3.2635%] (p = 0.00 < 0.05)
Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
14 (14.00%) low mild
6 (6.00%) high mild
to_char_scalar_1000 time: [312.03 ns 316.56 ns 320.91 ns]
change: [-4.3628% -3.1215% -1.7750%] (p = 0.00 < 0.05)
Performance has improved. If you want to run these benchmarks yourself, you can run: # use this to compare the alternate approaches (eager cast vs. selective retry)
cargo bench --package datafusion-functions "to_char" # use this to compare approaches to main
cargo bench --package datafusion-functions "to_char_array_date_only_patterns_1000|to_char_scalar_date_only_pattern_1000" |
Thanks for the additional work on this @friendlymatthew ! I think this approach is solid - the overhead for the casting is limited to only the cases where the format string includes time type specifiers and otherwise the performance is unchanged. |
Note: I'm sorry about the super long write ups. I'm not trying to bike shed. I was thinking about f906af5 and realized we could do more. The selective retry approach will extract every failed We can use the cast even more sparingly, by keeping track of consecutive failed Suppose you had a 1000 format attempts, and a contiguous block of those format attempts erred because they contained datetime specifiers: |----------|xxxxxxxxxxxxx|-------|
valid invalid valid For every invalid b379fe1 will keep track of consecutive failed This approach also helps with readability and saves LOC, BenchmarksShe leaves the existing code paths undisturbed like usual. While performance has slightly regressed, I really like how this approach looks and would prefer to call I'm also a bit surprised that calling # I ran the initial selective retry approach first.
# This run includes the new logic mentioned above.
to_char_array_date_only_patterns_1000
time: [136.92 µs 137.15 µs 137.39 µs]
change: [-3.1659% -2.3990% -1.6897%] (p = 0.00 < 0.05)
Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
7 (7.00%) high mild
12 (12.00%) high severe
to_char_array_datetime_patterns_1000
time: [543.24 µs 543.84 µs 544.47 µs]
change: [+4.4460% +6.0837% +7.4847%] (p = 0.00 < 0.05)
Performance has regressed.
to_char_array_mixed_patterns_1000
time: [339.68 µs 342.60 µs 345.66 µs]
change: [-5.3101% -2.9305% -0.8615%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
to_char_scalar_date_only_pattern_1000
time: [95.107 µs 98.972 µs 102.90 µs]
change: [-5.9031% -1.2243% +3.4337%] (p = 0.60 > 0.05)
No change in performance detected.
to_char_scalar_datetime_pattern_1000
time: [171.09 µs 179.74 µs 188.39 µs]
change: [-4.0895% +1.3179% +7.1836%] (p = 0.64 > 0.05)
No change in performance detected.
to_char_scalar_1000 time: [315.63 ns 319.69 ns 323.50 ns]
change: [-7.9240% -6.5685% -5.1341%] (p = 0.00 < 0.05)
Performance has improved. |
To put my notes here I prefer the version before the addition of the consecutive failures check - simpler and faster. |
to_char(date, timstamp format)
#14536Rationale for this change
Datafusion currently errs when attempting to format a date using time-related specifiers.
However, Postgres supports this feature as it implicitly treats the date as a timestamp.
Rather than eagerly casting every
Date32
to aDate64
when callingto_char
, this commit attempts to first format aDate32
with the supplied format string. If the formatting fails, we try to reformat as aDate64
. This way, only format strings with time-related specifiers endure the intermediary cast.All changes are tested, specifically for two different call sites:
_to_char_scalar
and_to_char_array
.