-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expression Simplifier doesn't consider associativity ((i + 1) + 2)
is not simplified to i + 3
)
#11594
Comments
Here is the full example from Discord: use datafusion::arrow::datatypes::{DataType, Field, Schema, TimeUnit};
use datafusion::optimizer::simplify_expressions::ExprSimplifier;
use datafusion::{error::Result, prelude::*};
use datafusion_common::ToDFSchema;
use datafusion_expr::execution_props::ExecutionProps;
use datafusion_expr::simplify::SimplifyContext;
#[tokio::main]
async fn main() -> Result<()> {
let schema = Schema::new(vec![make_field("i", DataType::Int64)]).to_dfschema_ref()?;
let props = ExecutionProps::new();
let context = SimplifyContext::new(&props).with_schema(schema);
let simplifier = ExprSimplifier::new(context);
// i + (1 + 2) => i + 3
assert_eq!(
simplifier.simplify(col("i") + (lit(1) + lit(2)))?,
col("i") + lit(3)
);
// i + 1 + 2 => i + 3 # not true. Why not?
assert_eq!(
simplifier.simplify(col("i") + lit(1) + lit(2))?,
col("i") + lit(3)
);
Ok(())
}
fn make_field(name: &str, data_type: DataType) -> Field {
let nullable = false;
Field::new(name, data_type, nullable)
} |
Off the cuff, if we go down this avenue I'd imagine wanting to check for associative, commutative, and distributive properties. For example,
In these examples, Associativity by itself my not pick up on I can imagine one approach of:
|
It would be interesting to do some research about what other systems do for this case (e.g. Calcite and DuckDB) I suspect we could add some good special cases related to literals (e.g. apply the rules @timsaucer is describing above) |
A couple of resources for using an existing CAS (Computer Algebra System). SymPy is a python package, so not as useful but appears to be well built and has a good description of some of the problems of simplifying: https://docs.sympy.org/latest/tutorials/intro-tutorial/simplification.html SymEngine is implemented in c++ and appears to support most of the use cases I have thought about so far: Non-official crate to wrap SymEngine https://crates.io/crates/symengine/0.1.0 Cas-rs is a newer CAS written in rust that appears to be under active development: |
Thank you @timsaucer -- very cool I think there would be a tradeoff between the size of the dependency / how mature it is and additional benefit in DataFusion So if we could implement something relatively simple (even if it duplicated existing functionality in another library) but didn't introduce a new dependency, I think that would be worth considering. |
DuckDB v1.0.0 (https://shell.duckdb.org/) behaves the same as Apache Datafusion. Doesn't consider associativity in expressions with a column reference.
|
I tried a simpler solution by re-arranging the expressions if they match the criteria for associativity (currently only considering We just need to re-arrange expressions like We might've some duplication of code due to the various positions that we need to consider:
|
My biggest concern with any solution for this issue is to keep the code complexity reasonable as well as not to slow down planning (by, e.g adding some sort of expontential search algorithm) |
Besides combining literal values, is there much value to this work? That is, are there other expressions that would reasonably get combined? I haven't looked much deeper at what the simplifier does. I completely agree about the code complexity. This is a non-trivial problem that could become a deep rabbit hole. |
I think agree literal values are the major usecase that I know if There are likely other simplifications that would benefit such as some of the logical simplifications but someone wuold have to think hard and find some examples I think adding some simple heuristics when literals are involved might be simplest way to make some improvements |
Would dropping unnecessary parenthesis in a 1st pass help? |
It might be that parens are purely syntactic thing. They don't exist after we exit from parser. |
Is your feature request related to a problem or challenge?
DataFusion will simplify expressions like this:
i + (1 + 2)
=>i + 3
However, it will not simplify
i + 1 + 2
(remainsi + 1 + 2
)You can see this in the explain plans
@timsaucer has identified the problem
So in this case
i + 1 + 2
is parsed as(i + 1) + 2
and since(i + 1)
can't be reduced, the entire expression isn't eitherDescribe the solution you'd like
It would be nice to properly support this simplification
Describe alternatives you've considered
We'll have to consider how to apply associativity (I am sure there is prior art in this area) as to solve the above issue it would need to potentially reorder the operations so constants are together and then also re-apply the const evaluator
datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Lines 711 to 1544 in 63efaee
Additional context
Came up on discord ( @kavirajk) https://discord.com/channels/885562378132000778/1166447479609376850/1264325499971436594
The text was updated successfully, but these errors were encountered: