-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
Many parts of the planning process require knowing the type, nullability and recently the field information for each Expr
The implementation of these methods recomputes the type information on demand, for example
datafusion/datafusion/expr/src/expr_schema.rs
Line 109 in 4f95e6d
| fn get_type(&self, schema: &dyn ExprSchema) -> Result<DataType> { |
This is slow for several reasons:
- The computation is re-computed many times [during](repo:apache/datafusion nullable) planning
- The computations themselves are sometimes non trivial (e.g. what @pepijnve added in #17801 Improve nullability reporting of case expressions #17813 for case expressions)
- The calculations have to walk the entire expression tree, each time even when they don't change
I have some small evidence that this type computation takes a non trivial amount of planning time, as for example, when I removed some of the cloning in to_field in this PR, I saw a significant improvement in planning time (2-3%)
Describe the solution you'd like
I would like to find some way to avoid recomputing the results of these calculations for Expressions
Describe alternatives you've considered
DataFusion already has a similar pattern for ExecutionPlan where the properties are computed once and then each ExecutionPlan must report the properties.
- https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.properties
- https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html
We also need to be careful about rolling this change incrementally rather than as one giant PR out as it will be a potentially disruptive API change (as Exprs are used all over the place)
So one way to cache this would be to add a method like
/// Cached result of computing type for
struct ExprProperties {
field: FieldRef,
table_reference: Option<TableReference>
}Then add
trait ExprSchemable {
/// return cached properties for the Expr, if available
fn try_properties(&self) -> Option<&ExprProperties>;
}
impl Expr {
fn get_type(&self) -> DataType {
// first check if we have pre-computed results
if let Some(properties) = self.try_properties() {
return properties.field.data_type()
}
// otherwise compute as normal
}
// add similar checks for nullable and to_field
}Then we could add an ExprProperties field to various Exprs over multiple PRs and roll out the cached values
Additional context
- See conversation with @pepijnve on #17801 Improve nullability reporting of case expressions #17813 (comment)