Skip to content

Public Expr simplification API #1694

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In IOx each table is broken up logically into chunks (like row groups in parquet files) but the chunks might be missing some columns and each chunk has its own statistics

When predicates are applied to scan / filter these chunks, they are potentially in terms of all columns of a table. If a chunk is missing a column (or we know from statistics that it is not null) expressions like col IS NULL and col IS NOT NULL can be replaced with true or false and predicates like col > 5 can be replaced with null > 5 in some cases

Once this substitution is done, that may allow additional simplification of the predicate -- ideally all the way down to true or false

One particular type of this expression we will use in IOx is to map null to a '' value like this:

CASE 
  WHEN col is NULL THEN '' 
  ELSE col 
END

The same general pattern likely holds for ParquetExec now that @thinkharderdev has added support to merge schemas for multiple files in #1622 once DataFusion is able to push predicates down into the parquet scans, simplifying the predicates as much as possible beforehand would be ideal.

The current API in https://github.com/apache/arrow-datafusion/blob/03075d5f4b3fdfd8f82144fcd409418832a4bf69/datafusion/src/optimizer/simplify_expressions.rs is

  1. Private
  2. Requires ExecutionProps which is fairly entangled with the overall machinery of how plans are executed (and means we see issues like DiskManager and TempFiles getting created several times per query #1690 )

Describe the solution you'd like
I would like a DataFusion to have a public API for simplifying expressions. Proposed looks like

pub trait ExprEvalContext {
}

struct Expr {
  fn simplify(self, &dyn ExprEvalContext) -> Self {
  }

}

I am thinking like ExprEvalContext as a trait so that it is clear what Expression Evaluation actually requires as well as allow Expr's to be simplified prior to execution or in the bowels of DataFusion's planer (and I will implement it for ExecutionProps).

Describe alternatives you've considered
I am not fully sure about the API design -- I'll know more when I sketch one out

Additional context
#1693
https://github.com/influxdata/influxdb_iox/pull/3557

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions