Description
Is your feature request related to a problem or challenge?
One of the things I've been thinking about when working on utf8view support in udfs is what exactly datafusion should support in terms of function signature types. Currently we haven't formalized what we expect functions to support and thus string functions are not consistent in terms of what they accept and what they generate.
@alamb also asked whether the level of specialization of a function was indeed required in #13403 (comment) and if a proposal to have guidelines for string functions should be made. This is my attempt at such a proposal.
Describe the solution you'd like
In the context of this proposal string functions are UDF's that accept and produce strings. This does exclusively mean udf's in functions/string
and functions/unicode
.
Data arguments are arguments that contain actual data that will be processed.
Config arguments are arguments that hold values that adjust how processing will occur. This could be regex's, concat separator, etc.
I would like to propose the following for DataFusion:
- String functions MUST accept both scalar and array values for all data arguments (vs config such as regex's 'flags' arguments).
- String functions MUST accept scalar values for all config arguments but MAY accept both scalar and array if appropriate for the function.
- String functions MUST accept all valid string types for all data arguments including
Dict(_, StringType)
. To ease implementation the type for all data arguments SHOULD be coerced to be the largest type among all the data arguments. - String functions MAY choose to allow non-contiguous data types for data arguments but it is NOT RECOMMENDED for functions with 3 or more arguments. Non-contiguous here means data arguments that are not of the same type (for example, example_fn(utf8, LargeUtf8, utf8)), Best practice is to use
Signature::String
or equivalent here. - String functions MAY choose to output in Utf8View instead of Utf8 if DataFusion is configured with
schema_force_view_types
==true
. Otherwise string functions SHOULD output string results in the same type as the received primary data argument. - String functions SHOULD rely on type coercion to handle non-string data. For example, concat('ab', 2, 'cc').
- String functions MUST handle non-control unicode textual character classes unless the function explicitly is designed for a particular character set (ascii for example)
- String functions SHOULD NOT attempt to specially handle unicode grapheme characters unless it's directly related to the function requirements.
Describe alternatives you've considered
No response