Skip to content

ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args #10205

@jayzhan211

Description

@jayzhan211

Is your feature request related to a problem or challenge?

File an issue to discuss about design in #10193, since it is no longer a minor change

Previously, we always provided a null array if the function supports zero args, like random(), pi(), make_array().
I think the additional null array should not provided for all the function that supports 0 args, instead handle case by case in each function. It turns out the null array is not useful either. random, pi, and uuid takes the number of rows instead of the actual null array.

We need to design an alternative way to communicate the number of rows to the function.

Proposal 1:
Add support_randomness -> bool to ScalarUDFImpl, and we provide the number of rows as the first argument for invoke.

Proposal 2:
We always provide the number of rows to invoke.

Change from

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue>;

to

/// batch_rows is the number of rows in each batch
fn invoke(&self, args: &[ColumnarValue], batch_rows: usize) -> Result<ColumnarValue>;
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        let inputs = self
            .args
            .iter()
            .map(|e| e.evaluate(batch))
            .collect::<Result<Vec<_>>>()?;

        // evaluate the function
        match self.fun {
            ScalarFunctionDefinition::UDF(ref fun) => fun.invoke(&inputs, batch.num_rows()),
            ScalarFunctionDefinition::Name(_) => {
                internal_err!(
                    "Name function must be resolved to one of the other variants prior to physical planning"
                )
            }
        }
    }

It is more aggressive and breaks the signature, but batch is part of the evaluate(), provide information about batch and args to invoke() makes sense to me.

Proposol 3 from alamb
Introduce another function with batch rows only.

fn invoke_no_args(number_rows: usize) -> {
  not_yet_impl_err!
}

#10205 (comment)

We need support_randomness -> bool, so we know to switch to invoke_no_args.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

I decide to go for 3, since there are already over 100 invoke() used, not worth to break the signature

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions