Skip to content

Conversation

@velvia
Copy link
Contributor

@velvia velvia commented Jun 15, 2021

Which issue does this PR close?

Closes #355 .

Rationale for this change

This PR implements the timestamp functions above in #355, basically extending the following capabilities for all Timestamp types (specifically for Millis, micros, and seconds) via the above named functions:

  • Ability to convert timestamp strings to Timestamp type
  • Ability to cast Int64 numeric arrays into equivalent Timestamp types

What changes are included in this PR?

Pretty straightforward additions to functions.rs, datetime_expressions etc.

What is NOT included so far but could be depending on discussion:

  • Not sure where to add documentation for these functions
  • It might be useful to users to be able to cast different Timestamp types into others, for example, Timestamp-Nanos to Timestamp-Millis, and that would be straightforward to add as well.

Are there any user-facing changes?

Three new user-facing SQL functions.

@velvia
Copy link
Contributor Author

velvia commented Jun 15, 2021

/cc @alamb

@alamb
Copy link
Contributor

alamb commented Jun 16, 2021

BTW I plan to review this later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Thank you @velvia

One thing I noticed while testing was that there wasn't a convenient way to cast a timestamp (nanoseconds) -- one had to go through in64.

So for example I can't apply to_timestamp_millis to the output of to_timestamp():

> select to_timestamp('2020-09-08T12:00:00+00:00')
> ;
+------------------------------------------+
| TimestampNanosecond(1599566400000000000) |
+------------------------------------------+
| 2020-09-08 12:00:00                      |
+------------------------------------------+
1 row in set. Query took 0.002 seconds.

> select to_timestamp_millis(to_timestamp('2020-09-08T12:00:00+00:00'));
Internal("Unsupported data type Ok(Timestamp(Nanosecond, None)) for function to_timestamp_millis")

However, there is a (very ugly) workaround using cast:

select to_timestamp_millis((cast(to_timestamp('2020-09-08T12:00:00+00:00') as bigint)) / 1000000);
+---------------------------------------------------------------------+
| totimestampmillis(Int64(1599566400000000000) Divide Int64(1000000)) |
+---------------------------------------------------------------------+
| 2020-09-08 12:00:00                                                 |
+---------------------------------------------------------------------+

}

/// Internal cast function for casting ColumnarValue -> ColumnarValue for cast_type
pub fn cast_array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this would be more clearly named cast_column or cast_columnar_value (as it might be casting an array or a Scalar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

BuiltinScalarFunction::ToTimestamp => Signature::Uniform(1, vec![DataType::Utf8]),
BuiltinScalarFunction::ToTimestampMillis => Signature::Uniform(
1,
vec![DataType::Utf8, DataType::LargeUtf8, DataType::Int64],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These signatures claim to support arguments of typeLargeUtf8 but LargeUtf8 isn't handled in the implementations above. I think we should make them consistent (I don't have a preference about actually supporting LargeUtf, btw)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I actually don't understand what LargeUTF is about, but since ToTimestamp only supports regular Utf8, is it OK if all of the timestamp functions just support regular Utf8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I actually don't understand what LargeUTF is about,

Utf8uses i32 to store offsets and thus can store strings up to 2GB. LargeUtf8 stores string offsets using u64s and thus can support strings that are larger than 2GB. I do not know how widely LargeUtf8 is used in practice.

is it OK if all of the timestamp functions just support regular Utf8?

yes I think so

1599568949190855000, // 2020-09-08T12:42:29.190855+00:00
1599565349190855000,
]; // 2020-09-08T11:42:29.190855+00:00
let divisor = match A::get_time_unit() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@velvia
Copy link
Contributor Author

velvia commented Jun 16, 2021

@alamb thanks for the review.

One thing I noticed while testing was that there wasn't a convenient way to cast a timestamp (nanoseconds) -- one had to go through in64.

I think I'll go ahead and add support for the use case of to_timestamp_xxx casting from other timestamp columns. I can imagine that being very useful, for us at least. :)

Also, I had asked about docs - I think the to_timestamp_xxx functions should be documented somewhere, since they are nonstandard wrt most SQL specs. The main README doesn't list explicitly supported timestamp functions, which we could, but I'm not sure anywhere else they could go. There is docs/specification, and then there is a user guide which looks like it is in an early stage of progress. Suggestions welcome!

@alamb
Copy link
Contributor

alamb commented Jun 16, 2021

Also, I had asked about docs - I think the to_timestamp_xxx functions should be documented somewhere, since they are nonstandard wrt most SQL specs. The main README doesn't list explicitly supported timestamp functions, which we could, but I'm not sure anywhere else they could go. There is docs/specification, and then there is a user guide which looks like it is in an early stage of progress. Suggestions welcome!

There is a list of supported functions here:
https://github.com/apache/arrow-datafusion#sql-support

Up to now, we have mostly been following the postgres functions and thus didn't need to document functions within DataFusion.

I suggest:

  1. start a section in the docs/userguide for DataFusion specific functions
  2. Add entries for to_timestamp_millis() in the list at https://github.com/apache/arrow-datafusion#sql-support linked to the user guide

@velvia
Copy link
Contributor Author

velvia commented Jun 16, 2021

I'm getting a very odd local compile error after merging with master latest:

error[E0432]: unresolved import `arrow::compute::kernels::partition`
  --> datafusion/src/physical_plan/mod.rs:28:30
   |
28 | use arrow::compute::kernels::partition::lexicographical_partition_ranges;
   |                              ^^^^^^^^^ could not find `partition` in `kernels`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0432`.
error: could not compile `datafusion`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error[E0432]: unresolved import `arrow::compute::kernels::partition`
  --> datafusion/src/physical_plan/mod.rs:28:30
   |
28 | use arrow::compute::kernels::partition::lexicographical_partition_ranges;
   |                              ^^^^^^^^^ could not find `partition` in `kernels`

error: aborting due to previous error

@alamb
Copy link
Contributor

alamb commented Jun 16, 2021

I'm getting a very odd local compile error after merging with master latest:

@velvia I think you need to do cargo update locally to pick up the latest release of arrow (4.3)

We should probably update the version requirements in datafusion to reflect that it requires arrow 4.3 or later (not 4.0 or later)

@velvia
Copy link
Contributor Author

velvia commented Jun 16, 2021

@alamb thanks, I can update cargo.toml for datafusion, that did the trick

@codecov-commenter
Copy link

codecov-commenter commented Jun 16, 2021

Codecov Report

Merging #567 (2b746c7) into master (51e5445) will increase coverage by 0.20%.
The diff coverage is 95.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #567      +/-   ##
==========================================
+ Coverage   76.02%   76.23%   +0.20%     
==========================================
  Files         156      156              
  Lines       27063    27280     +217     
==========================================
+ Hits        20575    20796     +221     
+ Misses       6488     6484       -4     
Impacted Files Coverage Δ
datafusion/src/physical_plan/expressions/binary.rs 89.47% <ø> (ø)
datafusion/src/physical_plan/expressions/mod.rs 71.42% <ø> (ø)
datafusion/src/physical_plan/expressions/nullif.rs 86.88% <ø> (ø)
datafusion/src/physical_plan/functions.rs 92.35% <85.07%> (-0.32%) ⬇️
...tafusion/src/physical_plan/datetime_expressions.rs 70.90% <100.00%> (+3.61%) ⬆️
datafusion/src/physical_plan/expressions/cast.rs 91.02% <100.00%> (+3.37%) ⬆️
datafusion/src/scalar.rs 57.34% <100.00%> (+1.15%) ⬆️
datafusion/tests/sql.rs 99.33% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51e5445...2b746c7. Read the comment docs.

@velvia
Copy link
Contributor Author

velvia commented Jun 16, 2021

@alamb I believe all changes and feedback have been addressed. I added the capability of to_timestamp_xx() to convert from other timestamp types for convenience, and also added documentation. :)

| `BOOLEAN` | `Boolean` |
| `DATE` | `Date32` |
| `TIME` | `Time64(TimeUnit::Millisecond)` |
| `TIMESTAMP` | `Timestamp(TimeUnit::Nanosecond)` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


## `to_timestamp`

`to_timestamp()` is similar to the standard SQL function. It performs conversions to type `Timestamp(Nanoseconds, None)`, from:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great docs

@alamb
Copy link
Contributor

alamb commented Jun 19, 2021

Thanks @velvia -- this looks great!

@alamb alamb merged commit 5900b4c into apache:master Jun 19, 2021
@velvia velvia deleted the evan/to_timestamp_millis branch June 21, 2021 17:37
@houqp houqp added datafusion enhancement New feature or request labels Jul 30, 2021
H0TB0X420 pushed a commit to H0TB0X420/datafusion that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement to_timestamp_millis etc

4 participants