Refactor `percentile_cont` to clarify support input types by Jefffrey · Pull Request #19611 · apache/datafusion

Jefffrey · 2026-01-02T14:42:20Z

Which issue does this PR close?

Related to Refactor away usage of NUMERICS/INTEGERS in datafusion/expr-common/src/type_coercion/aggregates.rs #18092

Rationale for this change

Current signature for percentile_cont is quite confusing in what types it accepts. Currently the code suggests it accepts all numeric types, where floats & decimals are maintained as is, integers are cast to float64 internally. However this is misleading as NUMERICS does not contain decimals, so currently everything is cast to float.

Clean up the code to make this more clear and do various other refactors.

What changes are included in this PR?

Use coercion signature to have type coercion coerce types to float so we can remove internal code related to casting to float or handling non-float types.

Various other refactors.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

Jefffrey · 2026-01-02T14:47:38Z

datafusion/functions-aggregate/src/percentile_cont.rs

+            signature: Signature::coercible(
+                vec![
+                    Coercion::new_implicit(
+                        TypeSignatureClass::Float,
+                        vec![TypeSignatureClass::Numeric],
+                        NativeType::Float64,
+                    ),
+                    Coercion::new_implicit(
+                        TypeSignatureClass::Native(logical_float64()),
+                        vec![TypeSignatureClass::Numeric],
+                        NativeType::Float64,
+                    ),
+                ],
+                Volatility::Immutable,


Here we are much clearer that we expect all expressions to be float64's, letting type coercion handle the casting for use so we can remove internal casts

Jefffrey · 2026-01-02T14:47:47Z

datafusion/functions-aggregate/src/percentile_cont.rs

        }
    }
-
-    fn create_accumulator(&self, args: &AccumulatorArgs) -> Result<Box<dyn Accumulator>> {


Inlining this

Jefffrey · 2026-01-02T14:48:16Z

datafusion/functions-aggregate/src/percentile_cont.rs

-                "percentile_cont does not support input type {}, must be numeric",
-                dt
-            ),
+            DataType::Null => Ok(DataType::Float64),


Null types are a little annoying to handle, see #19458

Jefffrey · 2026-01-02T14:48:39Z

datafusion/functions-aggregate/src/percentile_cont.rs

+                DataType::Float16 => Ok(Box::new(DistinctPercentileContAccumulator::<
+                    Float16Type,
+                >::new(percentile))),
+                DataType::Float32 => Ok(Box::new(DistinctPercentileContAccumulator::<
+                    Float32Type,
+                >::new(percentile))),
+                DataType::Float64 => Ok(Box::new(DistinctPercentileContAccumulator::<
+                    Float64Type,


Much more clear what the internal types we operate on are now

Jefffrey · 2026-01-02T14:50:18Z

datafusion/functions-aggregate/src/percentile_cont.rs

 /// in the final evaluation step so that we avoid expensive conversions and
 /// allocations during `update_batch`.
-struct PercentileContAccumulator<T: ArrowNumericType> {
-    data_type: DataType,


Removing data_type from all accumulators; this was only needed if T was ever a decimal type, to maintain precision/scale. However there was no valid way to have decimals come through to the accumulator anyway, and we want to cast to float anyway, so we can refactor this away

Jefffrey · 2026-01-02T14:50:43Z

datafusion/functions-aggregate/src/percentile_cont.rs

        // Build the result list array
        let list_array = ListArray::new(
-            Arc::new(Field::new_list_field(self.data_type.clone(), true)),
+            Arc::new(Field::new_list_field(T::DATA_TYPE, true)),


Since we don't need to consider decimals we can just replace usages with T::DATA_TYPE

Jefffrey · 2026-01-02T14:51:06Z

datafusion/functions-aggregate/src/percentile_cont.rs


    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
-        // Cast to target type if needed (e.g., integer to Float64)
-        let values = if values[0].data_type() != &self.data_type {


These are the internal casts we're removing; now input arguments should already be of the right types via type coercion

comphead · 2026-01-02T16:26:24Z

datafusion/functions-aggregate/src/percentile_cont.rs

+        let input_dt = args.expr_fields[0].data_type();
+        if input_dt.is_null() {
+            return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None))));
+        }


let input_dt = args.expr_fields[0].data_type(); if input_dt.is_null() { return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None)))); } let percentile = get_percentile(&args)?;

we can do early return here

I think it makes more sense to validate the percentile regardless of if datatype is null or not, otherwise could lead to permitting select percentile_cont(null, 2.0)

Oh I see now, validate is part of get_percentile, makes sense, we can prob optimize it in future so function will do mandatory validation however calc percentile for non null dtypes.

comphead · 2026-01-02T16:28:56Z

datafusion/functions-aggregate/src/percentile_cont.rs

-        );
-
-        let percentile = validate_percentile_expr(&args.exprs[1], "PERCENTILE_CONT")?;
+        let percentile = get_percentile(&args)?;


we dont need to handle null type here the same as in accumulator?

For simplicity I made groups_accumulator_supported return false if we have a null datatype input

comphead

Thanks @Jefffrey

Jefffrey · 2026-01-06T01:23:52Z

Thanks @comphead

Refactor percentile_cont to clarify support input types

8c3ffb2

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 2, 2026

Jefffrey commented Jan 2, 2026

View reviewed changes

Jefffrey marked this pull request as ready for review January 2, 2026 15:05

comphead reviewed Jan 2, 2026

View reviewed changes

adriangb self-requested a review January 2, 2026 22:45

Simplify percentile_cont simplify to omit casting

d2d14aa

comphead approved these changes Jan 3, 2026

View reviewed changes

Jefffrey added this pull request to the merge queue Jan 6, 2026

Merged via the queue into apache:main with commit ff38480 Jan 6, 2026
28 checks passed

Jefffrey deleted the refactor-percentile-cont branch January 6, 2026 01:23

Jefffrey mentioned this pull request Jan 6, 2026

fix(accumulators): preserve state in evaluate() for window frame queries #19618

Merged

Conversation

Jefffrey commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jefffrey commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jefffrey commented Jan 2, 2026 •

edited

Loading