Skip to content

Cache the result of Expr::data_type / Expr::nullable / Expr::to_field to speed up planning #18845

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Many parts of the planning process require knowing the type, nullability and recently the field information for each Expr

The implementation of these methods recomputes the type information on demand, for example

fn get_type(&self, schema: &dyn ExprSchema) -> Result<DataType> {

This is slow for several reasons:

  1. The computation is re-computed many times [during](repo:apache/datafusion nullable) planning
  2. The computations themselves are sometimes non trivial (e.g. what @pepijnve added in #17801 Improve nullability reporting of case expressions #17813 for case expressions)
  3. The calculations have to walk the entire expression tree, each time even when they don't change

I have some small evidence that this type computation takes a non trivial amount of planning time, as for example, when I removed some of the cloning in to_field in this PR, I saw a significant improvement in planning time (2-3%)

Describe the solution you'd like

I would like to find some way to avoid recomputing the results of these calculations for Expressions

Describe alternatives you've considered

DataFusion already has a similar pattern for ExecutionPlan where the properties are computed once and then each ExecutionPlan must report the properties.

We also need to be careful about rolling this change incrementally rather than as one giant PR out as it will be a potentially disruptive API change (as Exprs are used all over the place)

So one way to cache this would be to add a method like

/// Cached result of computing type for 
struct ExprProperties {
  field: FieldRef,
  table_reference: Option<TableReference>
}

Then add

trait ExprSchemable {
  /// return cached properties for the Expr, if available
  fn try_properties(&self) -> Option<&ExprProperties>;
}

impl Expr {
  fn get_type(&self) -> DataType {
    // first check if we have pre-computed results
    if let Some(properties) = self.try_properties() {
      return properties.field.data_type()
    }
    // otherwise compute as normal
  }

  // add similar checks for nullable and to_field
}

Then we could add an ExprProperties field to various Exprs over multiple PRs and roll out the cached values

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions