Skip to content

[BUG] 658: add strtok function #1032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AmosAidoo
Copy link
Contributor

@AmosAidoo AmosAidoo commented Jun 6, 2025

Which issue does this PR close?

Changes

  • Added a new scalar udf, StrtokFunc and implemented logic according to snowflake docs
  • Added a new enum type Utf8LikeArray to make it easy working with different string representations

How was this tested?

  • Added automated tests to validate functionality
  • All existing tests pass

@AmosAidoo
Copy link
Contributor Author

AmosAidoo commented Jun 6, 2025

I have some doubts about this part of the function signature:

TypeSignature::Exact(vec![DataType::Utf8, DataType::Utf8, DataType::Int64]),
TypeSignature::Exact(vec![
    DataType::LargeUtf8,
    DataType::LargeUtf8,
    DataType::Int64,
]),
TypeSignature::Exact(vec![
    DataType::Utf8View,
    DataType::Utf8View,
    DataType::Int64,
]),

When I use this instead:

Coercible(vec![
    Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
    Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
    Coercion::new_exact(TypeSignatureClass::Integer),
]),

calling with one of the string args as NULL fails with this error:

Internal("could not cast array of type Utf8View to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<i32>>")

Seems df assigns NULL in this case the type Uft8View and datafusion_common::cast::as_string_array fails with the above error. This then means that if Utf8View finds its way to this function, it will panic.

I have seen a number of ways in which Utf8View is been used in other parts of codebase but I would appreciate some thoughts/assistance here.

@AmosAidoo
Copy link
Contributor Author

Gentle ping @Vedin @ravlio
I would appreciate your review anytime you have some time to spare. Thanks

@AmosAidoo
Copy link
Contributor Author

I have some doubts about this part of the function signature:

TypeSignature::Exact(vec![DataType::Utf8, DataType::Utf8, DataType::Int64]),
TypeSignature::Exact(vec![
    DataType::LargeUtf8,
    DataType::LargeUtf8,
    DataType::Int64,
]),
TypeSignature::Exact(vec![
    DataType::Utf8View,
    DataType::Utf8View,
    DataType::Int64,
]),

When I use this instead:

Coercible(vec![
    Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
    Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
    Coercion::new_exact(TypeSignatureClass::Integer),
]),

calling with one of the string args as NULL fails with this error:

Internal("could not cast array of type Utf8View to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<i32>>")

Seems df assigns NULL in this case the type Uft8View and datafusion_common::cast::as_string_array fails with the above error. This then means that if Utf8View finds its way to this function, it will panic.

I have seen a number of ways in which Utf8View is been used in other parts of codebase but I would appreciate some thoughts/assistance here.

Regarding this, I have added a new enum type Utf8LikeArray which abstracts over the different string types. Additionally, datafusion_common::cast::as_string_array only deals with the Utf8 case so I am no longer using that here.
I think this new type will be useful for other UDFs that would have to handle these different string types.

Coercion::new_exact(TypeSignatureClass::Integer) wasn't handling the NULL case so I switched to Coercion::new_exact(TypeSignatureClass::Native(logical_int64())).

All tests pass now and no panics. I have edited the PR description to reflect this.

@rampage644 rampage644 requested review from Vedin and ravlio June 10, 2025 16:46
Copy link
Contributor

@Vedin Vedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for your contribution. I apologize for the long waiting review time. We were in the process of function crate restructure. Now you can check the updated readme for embucket-functions https://github.com/Embucket/embucket/tree/main/crates/embucket-functions and Implementation guide https://github.com/Embucket/embucket/blob/main/crates/embucket-functions/docs/function_implementation_guide.md. Hope it can improve our processes and code quality in general.
Regarding this PR. Can you please add the function name to implemented_functions.csv and regenerate Unimplemented Functions list or delete this function from generated_snowflake_functions.rs manually. So, this function can be called by our executors. More details about why we need it and how do it properly in in Implementation Guide

@AmosAidoo
Copy link
Contributor Author

Hi! Thanks for your contribution. I apologize for the long waiting review time. We were in the process of function crate restructure. Now you can check the updated readme for embucket-functions https://github.com/Embucket/embucket/tree/main/crates/embucket-functions and Implementation guide https://github.com/Embucket/embucket/blob/main/crates/embucket-functions/docs/function_implementation_guide.md. Hope it can improve our processes and code quality in general. Regarding this PR. Can you please add the function name to implemented_functions.csv and regenerate Unimplemented Functions list or delete this function from generated_snowflake_functions.rs manually. So, this function can be called by our executors. More details about why we need it and how do it properly in in Implementation Guide

Great. Thank you

@AmosAidoo
Copy link
Contributor Author

@Vedin I have made the changes

///
/// This is useful for writing UDFs that work across different string representations.
#[derive(Debug, Clone)]
pub(crate) enum Utf8LikeArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravlio Can you please look into this? Is there any other ways to handle such cases? If I remember correctly you have already did some functions with string arrays.

Copy link
Contributor

@ravlio ravlio Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good approach to managing various specific string array types is to encapsulate them within a single enum that exposes common methods.

My implementation is more complex https://github.com/Embucket/embucket/blob/9b157f8623f2b68c01199adcd6e660c948dcb009/crates/embucket-functions/src/string-binary/split.rs

use std::sync::Arc;

// strtok SQL function
// Tokenizes a given string and returns the requested part.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use Doc comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -3525,7 +3519,13 @@ pub const TABLE_FUNCTIONS: &[(&str, FunctionInfo)] = &[
)
.with_docs("https://docs.snowflake.com/en/sql-reference/functions/rest_event_history")
),
("SPLIT_TO_TABLE", FunctionInfo::new(
("RESULT_SCAN", FunctionInfo::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please delete this change with RESULT_SCAN manually for now. It's a bug with registering table_functions and should be fixed soon. For now please exclude it by hand.

@Vedin Vedin self-requested a review June 11, 2025 18:29
@AmosAidoo AmosAidoo force-pushed the 658/strtok branch 2 times, most recently from 9b58500 to d35720f Compare June 12, 2025 18:13
@AmosAidoo AmosAidoo requested a review from ravlio June 16, 2025 08:25
@Vedin Vedin self-requested a review June 16, 2025 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Missing function: strtok
3 participants