Refactor percentile_cont to clarify support input types#19611
Refactor percentile_cont to clarify support input types#19611Jefffrey merged 2 commits intoapache:mainfrom
percentile_cont to clarify support input types#19611Conversation
| signature: Signature::coercible( | ||
| vec![ | ||
| Coercion::new_implicit( | ||
| TypeSignatureClass::Float, | ||
| vec![TypeSignatureClass::Numeric], | ||
| NativeType::Float64, | ||
| ), | ||
| Coercion::new_implicit( | ||
| TypeSignatureClass::Native(logical_float64()), | ||
| vec![TypeSignatureClass::Numeric], | ||
| NativeType::Float64, | ||
| ), | ||
| ], | ||
| Volatility::Immutable, |
There was a problem hiding this comment.
Here we are much clearer that we expect all expressions to be float64's, letting type coercion handle the casting for use so we can remove internal casts
| } | ||
| } | ||
|
|
||
| fn create_accumulator(&self, args: &AccumulatorArgs) -> Result<Box<dyn Accumulator>> { |
| "percentile_cont does not support input type {}, must be numeric", | ||
| dt | ||
| ), | ||
| DataType::Null => Ok(DataType::Float64), |
There was a problem hiding this comment.
Null types are a little annoying to handle, see #19458
| DataType::Float16 => Ok(Box::new(DistinctPercentileContAccumulator::< | ||
| Float16Type, | ||
| >::new(percentile))), | ||
| DataType::Float32 => Ok(Box::new(DistinctPercentileContAccumulator::< | ||
| Float32Type, | ||
| >::new(percentile))), | ||
| DataType::Float64 => Ok(Box::new(DistinctPercentileContAccumulator::< | ||
| Float64Type, |
There was a problem hiding this comment.
Much more clear what the internal types we operate on are now
| /// in the final evaluation step so that we avoid expensive conversions and | ||
| /// allocations during `update_batch`. | ||
| struct PercentileContAccumulator<T: ArrowNumericType> { | ||
| data_type: DataType, |
There was a problem hiding this comment.
Removing data_type from all accumulators; this was only needed if T was ever a decimal type, to maintain precision/scale. However there was no valid way to have decimals come through to the accumulator anyway, and we want to cast to float anyway, so we can refactor this away
| // Build the result list array | ||
| let list_array = ListArray::new( | ||
| Arc::new(Field::new_list_field(self.data_type.clone(), true)), | ||
| Arc::new(Field::new_list_field(T::DATA_TYPE, true)), |
There was a problem hiding this comment.
Since we don't need to consider decimals we can just replace usages with T::DATA_TYPE
|
|
||
| fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
| // Cast to target type if needed (e.g., integer to Float64) | ||
| let values = if values[0].data_type() != &self.data_type { |
There was a problem hiding this comment.
These are the internal casts we're removing; now input arguments should already be of the right types via type coercion
| let input_dt = args.expr_fields[0].data_type(); | ||
| if input_dt.is_null() { | ||
| return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None)))); | ||
| } |
There was a problem hiding this comment.
let input_dt = args.expr_fields[0].data_type();
if input_dt.is_null() {
return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None))));
}
let percentile = get_percentile(&args)?;
There was a problem hiding this comment.
we can do early return here
There was a problem hiding this comment.
I think it makes more sense to validate the percentile regardless of if datatype is null or not, otherwise could lead to permitting select percentile_cont(null, 2.0)
There was a problem hiding this comment.
Oh I see now, validate is part of get_percentile, makes sense, we can prob optimize it in future so function will do mandatory validation however calc percentile for non null dtypes.
| ); | ||
|
|
||
| let percentile = validate_percentile_expr(&args.exprs[1], "PERCENTILE_CONT")?; | ||
| let percentile = get_percentile(&args)?; |
There was a problem hiding this comment.
we dont need to handle null type here the same as in accumulator?
There was a problem hiding this comment.
For simplicity I made groups_accumulator_supported return false if we have a null datatype input
|
Thanks @comphead |
Which issue does this PR close?
NUMERICS/INTEGERSindatafusion/expr-common/src/type_coercion/aggregates.rs#18092Rationale for this change
Current signature for
percentile_contis quite confusing in what types it accepts. Currently the code suggests it accepts all numeric types, where floats & decimals are maintained as is, integers are cast to float64 internally. However this is misleading asNUMERICSdoes not contain decimals, so currently everything is cast to float.Clean up the code to make this more clear and do various other refactors.
What changes are included in this PR?
Use coercion signature to have type coercion coerce types to float so we can remove internal code related to casting to float or handling non-float types.
Various other refactors.
Are these changes tested?
Existing tests.
Are there any user-facing changes?
No.