improve eliminate_outer_join rule #13249

JasonLi-cn · 2024-11-04T13:49:44Z

improve eliminate_outer_join rule

Which issue does this PR close?

Closes #13232

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

improve eliminate_outer_join rule

JasonLi-cn · 2024-11-04T13:59:42Z

This is an experiment, and i need help to determine if this solution is feasible. If feasible, I will:

improve expr_simplifier
add test
fix the substrait bug for pass null_decimal_literal test

cc @Dandandan @eejbyfeldt @alamb @Rachelint

Dandandan · 2024-11-04T14:08:39Z

datafusion/optimizer/src/eliminate_outer_join.rs

+        .iter()
+        .filter(|col| child_schema_columns.contains(*col))
+        .cloned();
+    is_restrict_null_predicate(join_schema, predicate_cloned, cols).unwrap_or(false)


Interesting find that this was already available 🥳

JasonLi-cn · 2024-11-04T15:35:20Z

the substrait bug blocks the current PR from passing null_decimal_literal test.

eejbyfeldt · 2024-11-04T20:33:55Z

datafusion/optimizer/src/eliminate_outer_join.rs

+    }
+
+    #[test]
+    fn eliminate_right_with_null_udf() -> Result<()> {


The test name makes it should to be that we expect it to eliminate the join, but we do not?

Since always_null(t1.b) IS NULL is not a restrict null predicate, it cannot eliminate the join. This is probably repeated with eliminate_left_with_null

My suggestion would be to name the test not_eliminate since we do not expect the test case to eliminate the outer join.

eejbyfeldt · 2024-11-04T20:35:58Z

datafusion/optimizer/src/eliminate_outer_join.rs

+
+        fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
+            Ok(DataType::Boolean)
+        }


Do we not need to implement an invoke method here? Or how does one know the test passes due to us properly checking volatility and not because the invoke method produces an error?

You were right. Maybe like this?

fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> { panic!() }

That or maybe some simple implementation. I would lead towards implementing it, just so so there is less chance of some other code change causing a panic here.

eejbyfeldt · 2024-11-04T20:38:04Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

+fn always_null<S: SimplifyInfo>(expr: &Expr, info: &S) -> bool {
+    is_null(expr)


Could we have some comment explaining when to use always_null over just is_null? Because to me it seems like one would always want always_null?

The is_null function cannot contain a non-Expr::Literal case where data type is DataType::Null(eg. Expr::Column). Maybe we should change the name of always_null? 🤔

Do we actually need that distinction? Or could we make all current usage of is_null use always_null?

If that is the case it seems like we should just improve is_null instead.

Do we still need these change now that we produce use null columns of the correct type instead of DataType::Null?

eejbyfeldt · 2024-11-04T20:40:56Z

datafusion/optimizer/src/utils.rs

-    // If result is single `true`, return false;
-    // If result is single `NULL` or `false`, return true;
-    Ok(match phys_expr.evaluate(&input_batch)? {
-        ColumnarValue::Array(array) => {


Why do we no longer need to handle exprs returning single element arrays?

Unlike before, we have now added the logic of 'expression simplification'. In ExprSimplifier, the previous evaluate is implemented by ConstEvaluator.

datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

Lines 633 to 650 in 0458d30

let phys_expr =

match create_physical_expr(&expr, &self.input_schema, self.execution_props) {

Ok(e) => e,

Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr),

};

let col_val = match phys_expr.evaluate(&self.input_batch) {

Ok(v) => v,

Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr),

};

match col_val {

ColumnarValue::Array(a) => {

if a.len() != 1 {

ConstSimplifyResult::SimplifyRuntimeError(

DataFusionError::Execution(format!("Could not evaluate the expression, found a result of length {}", a.len())),

expr,

)

} else if as_list_array(&a).is_ok() {

ConstSimplifyResult::Simplified(ScalarValue::List(

eejbyfeldt · 2024-11-04T20:41:41Z

datafusion/optimizer/src/utils.rs

+    let ret = match &expr {
+        Expr::Literal(scalar) if scalar.is_null() => true,
+        Expr::Literal(ScalarValue::Boolean(Some(b))) => !b,
+        _ if matches!(expr.get_type(input_schema)?, DataType::Null) => true,


Since this does not depend on evaluating expr, should we make this condition be an early return at the top of the method instead?

eejbyfeldt · 2024-11-04T20:50:15Z

datafusion/optimizer/src/utils.rs

-    let null_column = Column::from_name(DUMMY_COL_NAME);
+    let replace_columns = cols_of_predicate.into_iter().collect();
+    let replaced_predicate = replace_expr_with_null(predicate, &replace_columns)?;
+    let coerced_predicate = coerce(replaced_predicate, input_schema)?;


Since we have access to the schema here could could we replace the exprs with nulls with correct types instead and therefore avoiding needing to coerce the expr after changing it?

Or at least when it looked at the coerce logic it seemed quite complex and I have a hard time convincing my self that it can never affect the result of this function.

Good idea. Let me fix it.

I removed the coerce function. But some Exprs that is restrict null predicate cannot be judged to be true.
https://github.com/JasonLi-cn/arrow-datafusion/blob/3bb49de20476a12a0b0b42ae6869991a8d83cb7c/datafusion/optimizer/src/utils.rs#L263-L303

But since these casts are handled with type_coercion, they are also acceptable.
https://github.com/JasonLi-cn/arrow-datafusion/blob/fdc9f5974579ab6169c1577d8000c315df0219d7/datafusion/sqllogictest/test_files/joins.slt#L4321-L4337

eejbyfeldt · 2024-11-08T13:58:03Z

datafusion/optimizer/src/utils.rs

    }

-    // If result is single `true`, return false;
-    // If result is single `NULL` or `false`, return true;
-    Ok(match phys_expr.evaluate(&input_batch)? {


Why did stop using phys_expr.evaluate? Did that not allow us to handle more cases?

improve eliminate_outer_join rule

4fb183f

improve eliminate_outer_join rule

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait labels Nov 4, 2024

JasonLi-cn mentioned this pull request Nov 4, 2024

Eliminate more outer joins by supporting more expressions #13232

Open

Dandandan reviewed Nov 4, 2024

View reviewed changes

eejbyfeldt reviewed Nov 4, 2024

View reviewed changes

JasonLi-cn added 3 commits November 5, 2024 12:18

modify function

62a6dec

add some tests

3bb49de

add some sqllogictest

fdc9f59

eejbyfeldt reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve eliminate_outer_join rule #13249

improve eliminate_outer_join rule #13249

JasonLi-cn commented Nov 4, 2024

JasonLi-cn commented Nov 4, 2024

Dandandan Nov 4, 2024

JasonLi-cn commented Nov 4, 2024

eejbyfeldt Nov 4, 2024

JasonLi-cn Nov 5, 2024

eejbyfeldt Nov 8, 2024

eejbyfeldt Nov 4, 2024

JasonLi-cn Nov 5, 2024

eejbyfeldt Nov 8, 2024

eejbyfeldt Nov 4, 2024

JasonLi-cn Nov 5, 2024

eejbyfeldt Nov 8, 2024

eejbyfeldt Nov 8, 2024

eejbyfeldt Nov 4, 2024

JasonLi-cn Nov 5, 2024

eejbyfeldt Nov 4, 2024

eejbyfeldt Nov 4, 2024

JasonLi-cn Nov 5, 2024

JasonLi-cn Nov 5, 2024

eejbyfeldt Nov 8, 2024

		fn always_null<S: SimplifyInfo>(expr: &Expr, info: &S) -> bool {
		is_null(expr)

	let phys_expr =
	match create_physical_expr(&expr, &self.input_schema, self.execution_props) {
	Ok(e) => e,
	Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr),
	};
	let col_val = match phys_expr.evaluate(&self.input_batch) {
	Ok(v) => v,
	Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr),
	};
	match col_val {
	ColumnarValue::Array(a) => {
	if a.len() != 1 {
	ConstSimplifyResult::SimplifyRuntimeError(
	DataFusionError::Execution(format!("Could not evaluate the expression, found a result of length {}", a.len())),
	expr,
	)
	} else if as_list_array(&a).is_ok() {
	ConstSimplifyResult::Simplified(ScalarValue::List(

improve eliminate_outer_join rule #13249

Are you sure you want to change the base?

improve eliminate_outer_join rule #13249

Conversation

JasonLi-cn commented Nov 4, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

JasonLi-cn commented Nov 4, 2024

Choose a reason for hiding this comment

JasonLi-cn commented Nov 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment