-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make 'date_trunc' returns the same type as its input #6654
Make 'date_trunc' returns the same type as its input #6654
Conversation
4cd5917
to
6bed9a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
marking this one as a draft -- I think if it is merged to main the tests will fail |
6bed9a4
to
dbad81c
Compare
ca48905
to
e9e8817
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Weijun-H ! Looking much closer
@@ -884,8 +885,13 @@ impl BuiltinScalarFunction { | |||
], | |||
self.volatility(), | |||
), | |||
BuiltinScalarFunction::DateTrunc => Signature::exact( | |||
vec![Utf8, Timestamp(Nanosecond, None)], | |||
BuiltinScalarFunction::DateTrunc => Signature::one_of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this clause seems unreachable given the change above to match BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {
I would personally suggest keeping this change and removing the BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {
change as it make the code more explicy
_date_trunc(TimeUnit::Second, v, granularity.as_str(), f, tz_opt)? | ||
} | ||
ColumnarValue::Array(array) => { | ||
let array_type = array.data_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must be missing something here -- it seems like granularity
is not used for the array variants
So like date_trunc(x, 'minutes')
will return the same value as date_trunc(x, 'seconds')
which doesn't seem right
Maybe that indicates missing test coverage 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use f
function pointer, which uses granularity
as parameter
ColumnarValue::Scalar(ScalarValue::TimestampSecond(v, tz_opt)) => { | ||
_date_trunc(TimeUnit::Second, v, granularity.as_str(), f, tz_opt)? | ||
} | ||
ColumnarValue::Array(array) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Array, should it return the values based on the input type or just return TimestampNanoSecond
consistently 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Array, should it return the values based on the input type
It should do the same as is done for the Scalar version.
My understanding is that this PR changes the Scalar version to return a type based on the input types (not TimestampNanoSecond) so the array version should as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me -- and thought @waitingkuo or @viirya ?
let array = array | ||
.iter() | ||
.map(|x| { | ||
f(Some(x.unwrap() * 1_000_000_000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this PR doesn't introduce the problem, but it appears that date_trunc
panic's on unwrap()
🤔
❯ select date_trunc('hour', '2021-02-01T12:01:02');
+------------------------------------------------------+
| date_trunc(Utf8("hour"),Utf8("2021-02-01T12:01:02")) |
+------------------------------------------------------+
| 2021-02-01T12:00:00 |
+------------------------------------------------------+
1 row in set. Query took 0.074 seconds.
❯ select date_trunc('hour', null);
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/alamb/Software/arrow-datafusion/datafusion/physical-expr/src/datetime_expressions.rs:314:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I will file a ticket: #6701
891c36f
to
8981e08
Compare
I will take another look today or tomorrow. |
TimeUnit::Millisecond => { | ||
let value = match granularity { | ||
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000), | ||
"second" => Some(nano.unwrap() / 1_000_000_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"second" => Some(nano.unwrap() / 1_000_000_000), | |
"second" => Some(nano.unwrap() / 1_000_000_000 * 1_000), |
let value = match granularity { | ||
"minute" => Some(nano.unwrap() / 1_000_000_000), | ||
"second" => Some(nano.unwrap() / 1_000_000_000), | ||
"millisecond" => Some(nano.unwrap() / 1_000_000), | ||
"microsecond" => Some(nano.unwrap()), | ||
_ => Some(nano.unwrap() / 1_000_000_000), | ||
}; | ||
ScalarValue::TimestampSecond(value, tz_opt.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let value = match granularity { | |
"minute" => Some(nano.unwrap() / 1_000_000_000), | |
"second" => Some(nano.unwrap() / 1_000_000_000), | |
"millisecond" => Some(nano.unwrap() / 1_000_000), | |
"microsecond" => Some(nano.unwrap()), | |
_ => Some(nano.unwrap() / 1_000_000_000), | |
}; | |
ScalarValue::TimestampSecond(value, tz_opt.clone()) | |
let value = Some(nano.unwrap() / 1_000_000_000); | |
ScalarValue::TimestampSecond(value, tz_opt.clone()) |
As output unit is second, smaller granularity can be truncated simply.
"millisecond" => Some(nano.unwrap() / 1_000_000), | ||
"microsecond" => Some(nano.unwrap() / 1_000), | ||
_ => Some(nano.unwrap() / 1_000_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. For granularity smaller than millisecond can be truncated simply.
"millisecond" => Some(nano.unwrap() / 1_000_000), | |
"microsecond" => Some(nano.unwrap() / 1_000), | |
_ => Some(nano.unwrap() / 1_000_000), | |
_ => Some(nano.unwrap() / 1_000_000), |
TimeUnit::Microsecond => { | ||
let value = match granularity { | ||
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000), | ||
"second" => Some(nano.unwrap() / 1_000_000 * 1_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"second" => Some(nano.unwrap() / 1_000_000 * 1_000), | |
"second" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000), |
let value = match granularity { | ||
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000), | ||
"second" => Some(nano.unwrap() / 1_000_000 * 1_000), | ||
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000), | |
"millisecond" => Some(nano.unwrap() / 1_000_000 * 1_000), |
_ => { | ||
let value = match granularity { | ||
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000), | ||
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000), | |
"second" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000), |
let value = match granularity { | ||
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000), | ||
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000), | ||
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000), | |
"millisecond" => Some(nano.unwrap() / 1_000_000 * 1_000_000), |
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000), | ||
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000), | ||
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000), | ||
"microsecond" => Some(nano.unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"microsecond" => Some(nano.unwrap()), | |
"microsecond" => Some(nano.unwrap() / 1_000 * 1_000), |
I'm wondering maybe I misunderstand this function? Because the current change looks not correct mostly to me but all tests passed. 🤔 |
Seem tests does not cover all cases 🤔 |
8981e08
to
18dc784
Compare
c8878f3
to
43015c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Weijun-H -- this is looking so close
The only remaining question I have is about date_trunc_test
-- otherwise this looks good to go to me
@@ -827,7 +912,7 @@ mod tests { | |||
( | |||
"2020-09-08T13:42:29.190855Z", | |||
"second", | |||
"2020-09-08T13:42:29.000000Z", | |||
"2020-09-08T13:42:29.190855Z", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change doesn't seem correct to me -- the original value was correct (truncated to microsecond). Do you know why it was changed? As written it still has microsecond precision I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the truncation of timestamps to seconds occurred prematurely in date_trunc_single
, resulting in inaccurate results. Therefore, I have made the necessary adjustments and deferred the logic to the _date_trunc
function. By implementing this change, the granularity of the timestamps is preserved, and truncation is performed at a later stage to ensure accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps then we can update the test to call _date_trunc
let result = date_trunc_single(granularity, left).unwrap();
to
let result = _date_trunc(granularity, left).unwrap();
As written is seems to have the wrong answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified date_trunc_single
, so it will truncate the timestamp to second when the granularity is second. Now the test make sense.
if value.is_none() { | ||
return Ok(None); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works fine but I think you can simplify the logic with something like the following (and then avoid all the unwraps)
if value.is_none() { | |
return Ok(None); | |
} | |
let Some(value) = value else { | |
return Ok(None); | |
} |
daab764
to
d68ae79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if granularity == "millisecond" || granularity == "microsecond" { | ||
return Ok(value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #6654 (comment) is the answer -- perhaps I can remove this to make it easier to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very confusing -- I think it is because for those granularities, the conversion to chrono::DateTime and back is unecessary (as only sec/min/hour/etc are non uniform in width) -- I will try and make it clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is it? #6654 (comment) notes about the truncation of timestamps to seconds. But my question is why pulling "millisecond" and "microsecond" earlier which basically avoids with_nanosecond(0)
only, compared with previous code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the utilization of with_nanosecond(0)
, the timestamp experiences a reduction in granularity at the nanosecond level. But for "millisecond" and "microsecond", we need to keep this granularity and truncate it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean with_nanosecond
also cleans up millisecond and microsecond in the timestamp? Otherwise nanosecond is smaller granularity than millisecond and microsecond. Why reduction at nanosecond level will affect millisecond and microsecond?
E.g, timestamp is 1.001000001 (1s + 1 ms + 1 ns). After with_nanosecond(0)
, it will be 1.000000000 instead of 1.001000000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean with_nanosecond also cleans up millisecond and microsecond in the timestamp? Otherwise nanosecond is smaller granularity than millisecond and microsecond. Why reduction at nanosecond level will affect millisecond and microsecond?
Yes, I think that is correct.
I played around with it and we can simplify the logic like this which I think may make it easier to understand (the test still pass). If you like this better (I do) I can make it a PR (follow on to #6783)
diff --git a/datafusion/physical-expr/src/datetime_expressions.rs b/datafusion/physical-expr/src/datetime_expressions.rs
index 3a36e8a489..89ec508024 100644
--- a/datafusion/physical-expr/src/datetime_expressions.rs
+++ b/datafusion/physical-expr/src/datetime_expressions.rs
@@ -215,42 +215,49 @@ fn quarter_month(date: &NaiveDateTime) -> u32 {
}
fn date_trunc_single(granularity: &str, value: i64) -> Result<i64> {
- if granularity == "millisecond" || granularity == "microsecond" {
- return Ok(value);
- }
+ let value = timestamp_ns_to_datetime(value).ok_or_else(|| {
+ DataFusionError::Execution(format!("Timestamp {value} out of range"))
+ })?;
+
+ let value = Some(value);
- let value = timestamp_ns_to_datetime(value)
- .ok_or_else(|| {
- DataFusionError::Execution(format!("Timestamp {value} out of range"))
- })?
- .with_nanosecond(0);
let value = match granularity {
- "second" => value,
- "minute" => value.and_then(|d| d.with_second(0)),
+ "millisecond" => value,
+ "microsecond" => value,
+ "second" => value.and_then(|d| d.with_nanosecond(0)),
+ "minute" => value
+ .and_then(|d| d.with_nanosecond(0))
+ .and_then(|d| d.with_second(0)),
"hour" => value
+ .and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0))
.and_then(|d| d.with_minute(0)),
"day" => value
+ .and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0))
.and_then(|d| d.with_minute(0))
.and_then(|d| d.with_hour(0)),
"week" => value
+ .and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0))
.and_then(|d| d.with_minute(0))
.and_then(|d| d.with_hour(0))
.map(|d| d - Duration::seconds(60 * 60 * 24 * d.weekday() as i64)),
"month" => value
+ .and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0))
.and_then(|d| d.with_minute(0))
.and_then(|d| d.with_hour(0))
.and_then(|d| d.with_day0(0)),
"quarter" => value
+ .and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0))
.and_then(|d| d.with_minute(0))
.and_then(|d| d.with_hour(0))
.and_then(|d| d.with_day0(0))
.and_then(|d| d.with_month(quarter_month(&d))),
"year" => value
+ .and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0))
.and_then(|d| d.with_minute(0))
.and_then(|d| d.with_hour(0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think it looks clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let result = match tu { | ||
TimeUnit::Second => match granularity { | ||
"minute" => Some(nano / 1_000_000_000 / 60 * 60), | ||
_ => Some(nano / 1_000_000_000), | ||
}, | ||
TimeUnit::Millisecond => match granularity { | ||
"minute" => Some(nano / 1_000_000 / 1_000 / 60 * 1_000 * 60), | ||
"second" => Some(nano / 1_000_000 / 1_000 * 1_000), | ||
_ => Some(nano / 1_000_000), | ||
}, | ||
TimeUnit::Microsecond => match granularity { | ||
"minute" => Some(nano / 1_000 / 1_000_000 / 60 * 60 * 1_000_000), | ||
"second" => Some(nano / 1_000 / 1_000_000 * 1_000_000), | ||
"millisecond" => Some(nano / 1_000 / 1_000 * 1_000), | ||
_ => Some(nano / 1_000), | ||
}, | ||
_ => match granularity { | ||
"minute" => Some(nano / 1_000_000_000 / 60 * 1_000_000_000 * 60), | ||
"second" => Some(nano / 1_000_000_000 * 1_000_000_000), | ||
"millisecond" => Some(nano / 1_000_000 * 1_000_000), | ||
"microsecond" => Some(nano / 1_000 * 1_000), | ||
_ => Some(nano), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct.
Which issue does this PR close?
Closes #6653
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?