Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_date function #9019

Merged
merged 1 commit into from
Feb 29, 2024
Merged

Add to_date function #9019

merged 1 commit into from
Feb 29, 2024

Conversation

Tangruilin
Copy link
Contributor

@Tangruilin Tangruilin commented Jan 27, 2024

Which issue does this PR close?

Closes #8987 .

Rationale for this change

What changes are included in this PR?

Add to_date function, and change the protobuf file

Are these changes tested?

Not now, but it should be added in this PR.

Are there any user-facing changes?

Add a function named to_date, users can use select to_date(1243124); to try it.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Jan 27, 2024
@@ -1813,6 +1814,7 @@ pub fn parse_expr(
ScalarFunction::StructFun => {
Ok(struct_fun(parse_expr(&args[0], registry)?))
}
ScalarFunction::ToDate => Ok(struct_fun(parse_expr(&args[0], registry)?)), //TODO: ensure how to fill it
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I am not sure how to do with it, I need some suggestion

@Tangruilin Tangruilin marked this pull request as draft January 27, 2024 10:02
@Tangruilin
Copy link
Contributor Author

@Omega359
Hey! you can look it if you have time, I need you suggestion about the protobuf file and if there are something that i do not think about.

If it looks ok, I will write test code these days.

Thanks very much

@Omega359
Copy link
Contributor

I haven't had time to look it over yet but first thing I did see was the use of Date64 - I can't see any good reason to use that vs Date32

@Tangruilin
Copy link
Contributor Author

I haven't had time to look it over yet but first thing I did see was the use of Date64 - I can't see any good reason to use that vs Date32

Date64 contains more number in my opinion. Maybe Date32 is better?

@@ -1813,6 +1814,7 @@ pub fn parse_expr(
ScalarFunction::StructFun => {
Ok(struct_fun(parse_expr(&args[0], registry)?))
}
ScalarFunction::ToDate => Ok(struct_fun(parse_expr(&args[0], registry)?)), //TODO: ensure how to fill it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be

Suggested change
ScalarFunction::ToDate => Ok(struct_fun(parse_expr(&args[0], registry)?)), //TODO: ensure how to fill it
ScalarFunction::ToDate => Ok(to_date(parse_expr(&args[0], registry)?)),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 389 to 393
// fn to_date_impl<T: ArrowTimestampType + ScalarType<i64>>(
// args: &[ColumnarValue],
// ) -> Result<ColumnarValue> {
// }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// fn to_date_impl<T: ArrowTimestampType + ScalarType<i64>>(
// args: &[ColumnarValue],
// ) -> Result<ColumnarValue> {
// }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Omega359
Copy link
Contributor

Date64 contains more number in my opinion. Maybe Date32 is better?

    /// A signed 32-bit date representing the elapsed time since UNIX epoch (1970-01-01)
    /// in days (32 bits).
    Date32,
    /// A signed 64-bit date representing the elapsed time since UNIX epoch (1970-01-01)
    /// in milliseconds (64 bits). Values are evenly divisible by 86400000.
    Date64,

Date32 is the way.

@Tangruilin
Copy link
Contributor Author

Date64 contains more number in my opinion. Maybe Date32 is better?

    /// A signed 32-bit date representing the elapsed time since UNIX epoch (1970-01-01)
    /// in days (32 bits).
    Date32,
    /// A signed 64-bit date representing the elapsed time since UNIX epoch (1970-01-01)
    /// in milliseconds (64 bits). Values are evenly divisible by 86400000.
    Date64,

Date32 is the way.

Okey! I will change it to Date32
I may finish it these two days

@Tangruilin Tangruilin force-pushed the task#8987 branch 2 times, most recently from 474fb34 to 1a8ff01 Compare January 30, 2024 13:26
@Tangruilin Tangruilin marked this pull request as ready for review January 30, 2024 13:27
@Tangruilin
Copy link
Contributor Author

@alamb @Omega359 This PR is ready to be reviewed

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Tangruilin I left some minor suggestions.

@@ -1308,6 +1349,40 @@ fn validate_to_timestamp_data_types(
None
}

// TODO: 实现这个函数
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: 实现这个函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// to_date SQL function implementation
pub fn to_date_invoke(args: &[ColumnarValue]) -> Result<ColumnarValue> {
if args.is_empty() {
return internal_err!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return internal_err!(
return exec_err!(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that other functions are all using internal_err. If i need to change all of them?

Copy link
Member

@Weijun-H Weijun-H Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #8781 (comment), I vote for exec_err

internal_err: something that wasn't expected/anticipated by the implementation and that is most likely a bug (the error message even encourages users to open a bug report). I user should not be able to trigger this under normal circumstances. Note that I/O errors (or any error that happens due to external systems) do NOT fall under this category (there's an external error variant for that)

exec_err: a processing error that happens during execution, e.g. the user passed malformed arguments to a SQL method, opened a CSV file that is broken, tried to divide an integer by zero. these errors shouldn't happen for a "good" query + "good" data, but since both the query and the data can easily be broken or misaligned, these are common occurrences in ETL systems / databases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So need I change internal_err in to_timestamp_invoke to exec_err?
Or it should be done in another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow on pr is better

Comment on lines 1370 to 1378
DataType::Int32 | DataType::Int64 => {
cast_column(&args[0], &DataType::Date32, None)
}
DataType::Null | DataType::Float64 => {
cast_column(&args[0], &DataType::Date32, None)
}
DataType::Date32 | DataType::Date64 => {
cast_column(&args[0], &DataType::Date32, None)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataType::Int32 | DataType::Int64 => {
cast_column(&args[0], &DataType::Date32, None)
}
DataType::Null | DataType::Float64 => {
cast_column(&args[0], &DataType::Date32, None)
}
DataType::Date32 | DataType::Date64 => {
cast_column(&args[0], &DataType::Date32, None)
}
DataType::Int32
| DataType::Int64
| DataType::Null
| DataType::Float64
| DataType::Date32
| DataType::Date64 => cast_column(&args[0], &DataType::Date32, None),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SELECT to_date('01-14-2023 01:01:30+05:30', '%q', '%d-%m-%Y %H/%M/%S', '%+', '%m-%d-%Y %H:%M:%S%#z');
----
2023-01-13

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test for wrong input and return NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the wrong input will return error but not NULL like to_timestamp

|s, format| {
string_to_timestamp_nanos_formatted(s, format)
.map(|n| {
println!("{n}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
println!("{n}");

})
.and_then(|v| {
v.try_into().map_err(|_| {
DataFusionError::NotImplemented("()".to_string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusionError::NotImplemented("()".to_string())
internal_datafusion_err!("Unable to cast to Date32")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.map(|n| n / (1_000_000 * 24 * 60 * 60 * 1_000))
.and_then(|v| {
v.try_into().map_err(|_| {
DataFusionError::NotImplemented("()".to_string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusionError::NotImplemented("()".to_string())
internal_datafusion_err!("Unable to cast to Date32")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Tangruilin Tangruilin force-pushed the task#8987 branch 2 times, most recently from 740e38c to da55c5d Compare January 31, 2024 03:52
2023-01-13

statement error DataFusion error: Internal error: to_date function unsupported data type at index 1: List
SELECT to_date('2022-08-03T14:38:50+05:30', make_array('%s', '%q', '%d-%m-%Y %H:%M:%S%#z', '%+'));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to this, this is a example for wrong input

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there shjould be more tests of just dates without time or tz component. Need test to verify null, invalid formats, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a test for invalid dates like

to_date('21311111')

I think the tests here could be copied and adapted to cover to_date quite well:

https://github.com/apache/arrow-datafusion/blob/7641a3228156aab0e48c4bab5a6834b44f722d89/datafusion/sqllogictest/test_files/timestamps.slt#L2073-L2229

nary_scalar_expr!(
ToDate,
to_date,
"converts a string and optional formats to a `Date32`"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"converts a string and optional formats to a `Date32`"
"converts string to date according to the given format"

Comment on lines 401 to 414
1 => handle::<Date32Type, _, Date32Type>(
args,
|s| {
string_to_timestamp_nanos_shim(s)
.map(|n| n / (1_000_000 * 24 * 60 * 60 * 1_000))
.and_then(|v| {
v.try_into().map_err(|_| {
internal_datafusion_err!("Unable to cast to Date32 for converting from i64 to i32 failed")
})
})
},
name,
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use to_timestamp_impl to avoid code duplication.

Suggested change
1 => handle::<Date32Type, _, Date32Type>(
args,
|s| {
string_to_timestamp_nanos_shim(s)
.map(|n| n / (1_000_000 * 24 * 60 * 60 * 1_000))
.and_then(|v| {
v.try_into().map_err(|_| {
internal_datafusion_err!("Unable to cast to Date32 for converting from i64 to i32 failed")
})
})
},
name,
),
1 => to_timestamp_impl::<TimestampNanosecondType>(args, name),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not do that

for the reason the Date32 need i32 type and time_stamp need i64 type.

to_timestamp_impl did not deal with the converting from i64 to i32

@Tangruilin Tangruilin force-pushed the task#8987 branch 2 times, most recently from 0eade48 to bbf1b36 Compare February 1, 2024 04:03
@Tangruilin
Copy link
Contributor Author

@Tangruilin - Do you think you'll have time to complete this PR before v36 is released?

You can review it now

@Tangruilin
Copy link
Contributor Author

@Omega359
You can review this PR now, thanks

@alamb
Copy link
Contributor

alamb commented Feb 21, 2024

@Omega359 You can review this PR now, thanks

@Omega359 once you have reviewed this PR please let me know and I'll give it a final look too and hopefully merge it

Thank you for your work @Tangruilin

Copy link
Contributor

@Omega359 Omega359 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you work on this @Tangruilin! I left a few comments, mostly minor but 1 issue around the rustdoc where it'll unfortunately have to have 'ignore' added. You have some repeated tests still in dates.slt that should be cleaned up then I believe it's ready to go.

use datafusion::error::Result;
use datafusion::prelude::*;

/// This example demonstrates how to use the to_timestamp series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_timestamp -> to_date

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This example demonstrates how to use the to_timestamp series
/// This example demonstrates how to use the to_date series

ctx.register_batch("t", batch)?;
let df = ctx.table("t").await?;

// use to_timestamp function to convert col 'a' to timestamp type using the default parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_timestamp -> to_date, convert 'a to date type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// use to_timestamp function to convert col 'a' to timestamp type using the default parsing
// use to_date function to convert col 'a' to timestamp type using the default parsing

/// use datafusion::arrow::datatypes::{DataType, Field, Schema};
/// use datafusion::arrow::record_batch::RecordBatch;
/// use datafusion::error::Result;
/// use datafusion::prelude::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we'll have to add an ignore for this because of #9277 See https://github.com/apache/arrow-datafusion/pull/9279/files#diff-67ae8808785b2e651767d7ff67dd7c53be04ca098857b52c82ed19927e071cdaR514 for an example. I believe once the functions are move to the new dataframe-functions crate we should be able to re-enable.

/// ctx.register_batch("t", batch)?;
/// let df = ctx.table("t").await?;

/// // use to_timestamp function to convert col 'a' to timestamp type using the default parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp => date

|n| n,
"to_date",
),
_ => internal_err!("Unsupported 0 argument count for function to_date"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an exec_err (See #9241 for details)

| DataType::Date64 => cast_column(&args[0], &DataType::Date32, None),
DataType::Utf8 => to_date(args),
other => {
internal_err!("Unsupported data type {:?} for function to_date", other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec_err

@Tangruilin Tangruilin force-pushed the task#8987 branch 5 times, most recently from e5adc84 to 4262b5c Compare February 27, 2024 16:14
@github-actions github-actions bot added the core Core DataFusion crate label Feb 27, 2024
@Tangruilin
Copy link
Contributor Author

@Omega359 The cargo test --doc command is passed on my mac.
But the CI fail. I do not know why

@Omega359
Copy link
Contributor

@Omega359 The cargo test --doc command is passed on my mac. But the CI fail. I do not know why

I think it's because you are attempting to include /// # use datafusion::prelude::*; which isn't imported in that module. You can't import it either because of #9277. As mentioned in my review you'll have to ignore that doc test until we move the functions out of this module to the new datafusion/functions area.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks good to me -- thank you @Tangruilin and @Omega359

There are several cleanup suggestions left outstanding from @Omega359 related to fixing up repeated tests / comments. Is this something you have time to fix @Tangruilin before we merge the PR? Otherwise we can do it as a follow on PR

use datafusion::error::Result;
use datafusion::prelude::*;

/// This example demonstrates how to use the to_timestamp series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This example demonstrates how to use the to_timestamp series
/// This example demonstrates how to use the to_date series

ctx.register_batch("t", batch)?;
let df = ctx.table("t").await?;

// use to_timestamp function to convert col 'a' to timestamp type using the default parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// use to_timestamp function to convert col 'a' to timestamp type using the default parsing
// use to_date function to convert col 'a' to timestamp type using the default parsing

@Tangruilin
Copy link
Contributor Author

I think this PR looks good to me -- thank you @Tangruilin and @Omega359

There are several cleanup suggestions left outstanding from @Omega359 related to fixing up repeated tests / comments. Is this something you have time to fix @Tangruilin before we merge the PR? Otherwise we can do it as a follow on PR

I think i can fix it in this PR

@Tangruilin Tangruilin force-pushed the task#8987 branch 2 times, most recently from b7e202f to da8c3b7 Compare February 28, 2024 16:54
@github-actions github-actions bot removed the core Core DataFusion crate label Feb 28, 2024
@Tangruilin
Copy link
Contributor Author

I think this PR looks good to me -- thank you @Tangruilin and @Omega359
There are several cleanup suggestions left outstanding from @Omega359 related to fixing up repeated tests / comments. Is this something you have time to fix @Tangruilin before we merge the PR? Otherwise we can do it as a follow on PR

I think i can fix it in this PR

Done

Comment on lines 177 to 195
# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')

# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')

# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')

# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')

# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')
# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')
# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')
# to_date with invalid formatting
query error input contains invalid characters
SELECT to_date('2020-09-08 12/00/00+00:00', '%c', '%+')
# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')

Comment on lines 201 to 179
# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')

# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')

# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')
# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')
# to_date with broken formatting
query error bad or unsupported format string
SELECT to_date('2020-09-08 12/00/00+00:00', '%q')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,you can review

Signed-off-by: tangruilin <tang.ruilin@foxmail.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Tangruilin and @Omega359 -- the feature and the review is really appreciated. Nice work

@alamb alamb merged commit de902de into apache:main Feb 29, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a to_date function
4 participants