Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 'date_trunc' returns the same type as its input #6654

Merged
merged 9 commits into from
Jun 27, 2023

Conversation

Weijun-H
Copy link
Member

Which issue does this PR close?

Closes #6653

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Jun 12, 2023
@Weijun-H Weijun-H force-pushed the date_trunc-inconsistent-return-type branch from 4cd5917 to 6bed9a4 Compare June 12, 2023 22:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this one @Weijun-H -- but I think this one needs a little more work

cc @viirya

@alamb
Copy link
Contributor

alamb commented Jun 14, 2023

marking this one as a draft -- I think if it is merged to main the tests will fail

@alamb alamb marked this pull request as draft June 14, 2023 20:29
@Weijun-H Weijun-H force-pushed the date_trunc-inconsistent-return-type branch from 6bed9a4 to dbad81c Compare June 14, 2023 21:34
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jun 14, 2023
@Weijun-H Weijun-H force-pushed the date_trunc-inconsistent-return-type branch from ca48905 to e9e8817 Compare June 15, 2023 00:51
@Weijun-H Weijun-H marked this pull request as ready for review June 15, 2023 01:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H ! Looking much closer

@@ -884,8 +885,13 @@ impl BuiltinScalarFunction {
],
self.volatility(),
),
BuiltinScalarFunction::DateTrunc => Signature::exact(
vec![Utf8, Timestamp(Nanosecond, None)],
BuiltinScalarFunction::DateTrunc => Signature::one_of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this clause seems unreachable given the change above to match BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {

I would personally suggest keeping this change and removing the BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => { change as it make the code more explicy

_date_trunc(TimeUnit::Second, v, granularity.as_str(), f, tz_opt)?
}
ColumnarValue::Array(array) => {
let array_type = array.data_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something here -- it seems like granularity is not used for the array variants

So like date_trunc(x, 'minutes') will return the same value as date_trunc(x, 'seconds') which doesn't seem right

Maybe that indicates missing test coverage 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use f function pointer, which uses granularity as parameter

ColumnarValue::Scalar(ScalarValue::TimestampSecond(v, tz_opt)) => {
_date_trunc(TimeUnit::Second, v, granularity.as_str(), f, tz_opt)?
}
ColumnarValue::Array(array) => {
Copy link
Member Author

@Weijun-H Weijun-H Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Array, should it return the values based on the input type or just return TimestampNanoSecond consistently 🤔?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Array, should it return the values based on the input type

It should do the same as is done for the Scalar version.

My understanding is that this PR changes the Scalar version to return a type based on the input types (not TimestampNanoSecond) so the array version should as well

alamb
alamb previously approved these changes Jun 16, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me -- and thought @waitingkuo or @viirya ?

let array = array
.iter()
.map(|x| {
f(Some(x.unwrap() * 1_000_000_000))
Copy link
Contributor

@alamb alamb Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this PR doesn't introduce the problem, but it appears that date_trunc panic's on unwrap() 🤔

❯ select date_trunc('hour', '2021-02-01T12:01:02');
+------------------------------------------------------+
| date_trunc(Utf8("hour"),Utf8("2021-02-01T12:01:02")) |
+------------------------------------------------------+
| 2021-02-01T12:00:00                                  |
+------------------------------------------------------+
1 row in set. Query took 0.074 seconds.
❯ select date_trunc('hour', null);
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/alamb/Software/arrow-datafusion/datafusion/physical-expr/src/datetime_expressions.rs:314:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I will file a ticket: #6701

@alamb alamb changed the title fix: 'datatrunc' return inconsistent type Make 'date_trunc' returns the same type as its input Jun 16, 2023
@Weijun-H Weijun-H force-pushed the date_trunc-inconsistent-return-type branch 2 times, most recently from 891c36f to 8981e08 Compare June 16, 2023 15:36
@viirya
Copy link
Member

viirya commented Jun 16, 2023

I will take another look today or tomorrow.

TimeUnit::Millisecond => {
let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000),
"second" => Some(nano.unwrap() / 1_000_000_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"second" => Some(nano.unwrap() / 1_000_000_000),
"second" => Some(nano.unwrap() / 1_000_000_000 * 1_000),

Comment on lines 281 to 300
let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000),
"second" => Some(nano.unwrap() / 1_000_000_000),
"millisecond" => Some(nano.unwrap() / 1_000_000),
"microsecond" => Some(nano.unwrap()),
_ => Some(nano.unwrap() / 1_000_000_000),
};
ScalarValue::TimestampSecond(value, tz_opt.clone())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000),
"second" => Some(nano.unwrap() / 1_000_000_000),
"millisecond" => Some(nano.unwrap() / 1_000_000),
"microsecond" => Some(nano.unwrap()),
_ => Some(nano.unwrap() / 1_000_000_000),
};
ScalarValue::TimestampSecond(value, tz_opt.clone())
let value = Some(nano.unwrap() / 1_000_000_000);
ScalarValue::TimestampSecond(value, tz_opt.clone())

As output unit is second, smaller granularity can be truncated simply.

Comment on lines 294 to 306
"millisecond" => Some(nano.unwrap() / 1_000_000),
"microsecond" => Some(nano.unwrap() / 1_000),
_ => Some(nano.unwrap() / 1_000_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. For granularity smaller than millisecond can be truncated simply.

Suggested change
"millisecond" => Some(nano.unwrap() / 1_000_000),
"microsecond" => Some(nano.unwrap() / 1_000),
_ => Some(nano.unwrap() / 1_000_000),
_ => Some(nano.unwrap() / 1_000_000),

TimeUnit::Microsecond => {
let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000),
"second" => Some(nano.unwrap() / 1_000_000 * 1_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"second" => Some(nano.unwrap() / 1_000_000 * 1_000),
"second" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000),

let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000),
"second" => Some(nano.unwrap() / 1_000_000 * 1_000),
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000),
"millisecond" => Some(nano.unwrap() / 1_000_000 * 1_000),

_ => {
let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000),
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000),
"second" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000),

let value = match granularity {
"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000),
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000),
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000),
"millisecond" => Some(nano.unwrap() / 1_000_000 * 1_000_000),

"minute" => Some(nano.unwrap() / 1_000_000_000 * 1_000_000_000),
"second" => Some(nano.unwrap() / 1_000_000 * 1_000_000),
"millisecond" => Some(nano.unwrap() / 1_000 * 1_000),
"microsecond" => Some(nano.unwrap()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"microsecond" => Some(nano.unwrap()),
"microsecond" => Some(nano.unwrap() / 1_000 * 1_000),

@viirya
Copy link
Member

viirya commented Jun 17, 2023

I'm wondering maybe I misunderstand this function? Because the current change looks not correct mostly to me but all tests passed. 🤔

@Weijun-H
Copy link
Member Author

I'm wondering maybe I misunderstand this function? Because the current change looks not correct mostly to me but all tests passed. 🤔

Seem tests does not cover all cases 🤔

@alamb alamb dismissed their stale review June 20, 2023 18:32

needs more test coverage it seems

@Weijun-H Weijun-H force-pushed the date_trunc-inconsistent-return-type branch from 8981e08 to 18dc784 Compare June 24, 2023 11:04
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 24, 2023
@Weijun-H Weijun-H force-pushed the date_trunc-inconsistent-return-type branch 2 times, most recently from c8878f3 to 43015c5 Compare June 24, 2023 12:27
@Weijun-H Weijun-H requested review from alamb and viirya June 24, 2023 13:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H -- this is looking so close

The only remaining question I have is about date_trunc_test -- otherwise this looks good to go to me

@@ -827,7 +912,7 @@ mod tests {
(
"2020-09-08T13:42:29.190855Z",
"second",
"2020-09-08T13:42:29.000000Z",
"2020-09-08T13:42:29.190855Z",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't seem correct to me -- the original value was correct (truncated to microsecond). Do you know why it was changed? As written it still has microsecond precision I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow-datafusion/blob/b17cf0710938eed1c9f7794d5695bfac89bff580/datafusion/physical-expr/src/datetime_expressions.rs#L217-L223

Previously, the truncation of timestamps to seconds occurred prematurely in date_trunc_single, resulting in inaccurate results. Therefore, I have made the necessary adjustments and deferred the logic to the _date_trunc function. By implementing this change, the granularity of the timestamps is preserved, and truncation is performed at a later stage to ensure accuracy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps then we can update the test to call _date_trunc

            let result = date_trunc_single(granularity, left).unwrap();

to

            let result = _date_trunc(granularity, left).unwrap();

As written is seems to have the wrong answer

Copy link
Member Author

@Weijun-H Weijun-H Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified date_trunc_single, so it will truncate the timestamp to second when the granularity is second. Now the test make sense.

Comment on lines 285 to 287
if value.is_none() {
return Ok(None);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine but I think you can simplify the logic with something like the following (and then avoid all the unwraps)

Suggested change
if value.is_none() {
return Ok(None);
}
let Some(value) = value else {
return Ok(None);
}

@alamb alamb force-pushed the date_trunc-inconsistent-return-type branch from daab764 to d68ae79 Compare June 27, 2023 18:16
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H

I force pushed this branch because it got caught up in the great "main got force-pushed" debacle (see #6775) -- once CI passes I plan to merge it in.

@alamb alamb merged commit ac8f690 into apache:main Jun 27, 2023
Comment on lines +218 to +220
if granularity == "millisecond" || granularity == "microsecond" {
return Ok(value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? Why this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #6654 (comment) is the answer -- perhaps I can remove this to make it easier to understand

Copy link
Contributor

@alamb alamb Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very confusing -- I think it is because for those granularities, the conversion to chrono::DateTime and back is unecessary (as only sec/min/hour/etc are non uniform in width) -- I will try and make it clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is it? #6654 (comment) notes about the truncation of timestamps to seconds. But my question is why pulling "millisecond" and "microsecond" earlier which basically avoids with_nanosecond(0) only, compared with previous code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the utilization of with_nanosecond(0), the timestamp experiences a reduction in granularity at the nanosecond level. But for "millisecond" and "microsecond", we need to keep this granularity and truncate it later.

Copy link
Member

@viirya viirya Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean with_nanosecond also cleans up millisecond and microsecond in the timestamp? Otherwise nanosecond is smaller granularity than millisecond and microsecond. Why reduction at nanosecond level will affect millisecond and microsecond?

E.g, timestamp is 1.001000001 (1s + 1 ms + 1 ns). After with_nanosecond(0), it will be 1.000000000 instead of 1.001000000?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean with_nanosecond also cleans up millisecond and microsecond in the timestamp? Otherwise nanosecond is smaller granularity than millisecond and microsecond. Why reduction at nanosecond level will affect millisecond and microsecond?

Yes, I think that is correct.

I played around with it and we can simplify the logic like this which I think may make it easier to understand (the test still pass). If you like this better (I do) I can make it a PR (follow on to #6783)

diff --git a/datafusion/physical-expr/src/datetime_expressions.rs b/datafusion/physical-expr/src/datetime_expressions.rs
index 3a36e8a489..89ec508024 100644
--- a/datafusion/physical-expr/src/datetime_expressions.rs
+++ b/datafusion/physical-expr/src/datetime_expressions.rs
@@ -215,42 +215,49 @@ fn quarter_month(date: &NaiveDateTime) -> u32 {
 }
 
 fn date_trunc_single(granularity: &str, value: i64) -> Result<i64> {
-    if granularity == "millisecond" || granularity == "microsecond" {
-        return Ok(value);
-    }
+    let value = timestamp_ns_to_datetime(value).ok_or_else(|| {
+        DataFusionError::Execution(format!("Timestamp {value} out of range"))
+    })?;
+
+    let value = Some(value);
 
-    let value = timestamp_ns_to_datetime(value)
-        .ok_or_else(|| {
-            DataFusionError::Execution(format!("Timestamp {value} out of range"))
-        })?
-        .with_nanosecond(0);
     let value = match granularity {
-        "second" => value,
-        "minute" => value.and_then(|d| d.with_second(0)),
+        "millisecond" => value,
+        "microsecond" => value,
+        "second" => value.and_then(|d| d.with_nanosecond(0)),
+        "minute" => value
+            .and_then(|d| d.with_nanosecond(0))
+            .and_then(|d| d.with_second(0)),
         "hour" => value
+            .and_then(|d| d.with_nanosecond(0))
             .and_then(|d| d.with_second(0))
             .and_then(|d| d.with_minute(0)),
         "day" => value
+            .and_then(|d| d.with_nanosecond(0))
             .and_then(|d| d.with_second(0))
             .and_then(|d| d.with_minute(0))
             .and_then(|d| d.with_hour(0)),
         "week" => value
+            .and_then(|d| d.with_nanosecond(0))
             .and_then(|d| d.with_second(0))
             .and_then(|d| d.with_minute(0))
             .and_then(|d| d.with_hour(0))
             .map(|d| d - Duration::seconds(60 * 60 * 24 * d.weekday() as i64)),
         "month" => value
+            .and_then(|d| d.with_nanosecond(0))
             .and_then(|d| d.with_second(0))
             .and_then(|d| d.with_minute(0))
             .and_then(|d| d.with_hour(0))
             .and_then(|d| d.with_day0(0)),
         "quarter" => value
+            .and_then(|d| d.with_nanosecond(0))
             .and_then(|d| d.with_second(0))
             .and_then(|d| d.with_minute(0))
             .and_then(|d| d.with_hour(0))
             .and_then(|d| d.with_day0(0))
             .and_then(|d| d.with_month(quarter_month(&d))),
         "year" => value
+            .and_then(|d| d.with_nanosecond(0))
             .and_then(|d| d.with_second(0))
             .and_then(|d| d.with_minute(0))
             .and_then(|d| d.with_hour(0))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think it looks clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +291 to +313
let result = match tu {
TimeUnit::Second => match granularity {
"minute" => Some(nano / 1_000_000_000 / 60 * 60),
_ => Some(nano / 1_000_000_000),
},
TimeUnit::Millisecond => match granularity {
"minute" => Some(nano / 1_000_000 / 1_000 / 60 * 1_000 * 60),
"second" => Some(nano / 1_000_000 / 1_000 * 1_000),
_ => Some(nano / 1_000_000),
},
TimeUnit::Microsecond => match granularity {
"minute" => Some(nano / 1_000 / 1_000_000 / 60 * 60 * 1_000_000),
"second" => Some(nano / 1_000 / 1_000_000 * 1_000_000),
"millisecond" => Some(nano / 1_000 / 1_000 * 1_000),
_ => Some(nano / 1_000),
},
_ => match granularity {
"minute" => Some(nano / 1_000_000_000 / 60 * 1_000_000_000 * 60),
"second" => Some(nano / 1_000_000_000 * 1_000_000_000),
"millisecond" => Some(nano / 1_000_000 * 1_000_000),
"microsecond" => Some(nano / 1_000 * 1_000),
_ => Some(nano),
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

date_trunc always returns Timestamp(Nanosecond, None) which might truncate ranges
3 participants