Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve eliminate_outer_join rule #13249

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

JasonLi-cn
Copy link
Contributor

improve eliminate_outer_join rule

Which issue does this PR close?

Closes #13232

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

improve eliminate_outer_join rule
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait labels Nov 4, 2024
@JasonLi-cn
Copy link
Contributor Author

This is an experiment, and i need help to determine if this solution is feasible. If feasible, I will:

  • improve expr_simplifier
  • add test
  • fix the substrait bug for pass null_decimal_literal test

cc @Dandandan @eejbyfeldt @alamb @Rachelint

.iter()
.filter(|col| child_schema_columns.contains(*col))
.cloned();
is_restrict_null_predicate(join_schema, predicate_cloned, cols).unwrap_or(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting find that this was already available 🥳

@JasonLi-cn
Copy link
Contributor Author

the substrait bug blocks the current PR from passing null_decimal_literal test.

}

#[test]
fn eliminate_right_with_null_udf() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name makes it should to be that we expect it to eliminate the join, but we do not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since always_null(t1.b) IS NULL is not a restrict null predicate, it cannot eliminate the join. This is probably repeated with eliminate_left_with_null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to name the test not_eliminate since we do not expect the test case to eliminate the outer join.


fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Ok(DataType::Boolean)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need to implement an invoke method here? Or how does one know the test passes due to us properly checking volatility and not because the invoke method produces an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right. Maybe like this?

fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> {
    panic!()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That or maybe some simple implementation. I would lead towards implementing it, just so so there is less chance of some other code change causing a panic here.

Comment on lines +1698 to +1699
fn always_null<S: SimplifyInfo>(expr: &Expr, info: &S) -> bool {
is_null(expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some comment explaining when to use always_null over just is_null? Because to me it seems like one would always want always_null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_null function cannot contain a non-Expr::Literal case where data type is DataType::Null(eg. Expr::Column). Maybe we should change the name of always_null? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need that distinction? Or could we make all current usage of is_null use always_null?

If that is the case it seems like we should just improve is_null instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need these change now that we produce use null columns of the correct type instead of DataType::Null?

// If result is single `true`, return false;
// If result is single `NULL` or `false`, return true;
Ok(match phys_expr.evaluate(&input_batch)? {
ColumnarValue::Array(array) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we no longer need to handle exprs returning single element arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike before, we have now added the logic of 'expression simplification'. In ExprSimplifier, the previous evaluate is implemented by ConstEvaluator.

let phys_expr =
match create_physical_expr(&expr, &self.input_schema, self.execution_props) {
Ok(e) => e,
Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr),
};
let col_val = match phys_expr.evaluate(&self.input_batch) {
Ok(v) => v,
Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr),
};
match col_val {
ColumnarValue::Array(a) => {
if a.len() != 1 {
ConstSimplifyResult::SimplifyRuntimeError(
DataFusionError::Execution(format!("Could not evaluate the expression, found a result of length {}", a.len())),
expr,
)
} else if as_list_array(&a).is_ok() {
ConstSimplifyResult::Simplified(ScalarValue::List(

let ret = match &expr {
Expr::Literal(scalar) if scalar.is_null() => true,
Expr::Literal(ScalarValue::Boolean(Some(b))) => !b,
_ if matches!(expr.get_type(input_schema)?, DataType::Null) => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this does not depend on evaluating expr, should we make this condition be an early return at the top of the method instead?

let null_column = Column::from_name(DUMMY_COL_NAME);
let replace_columns = cols_of_predicate.into_iter().collect();
let replaced_predicate = replace_expr_with_null(predicate, &replace_columns)?;
let coerced_predicate = coerce(replaced_predicate, input_schema)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have access to the schema here could could we replace the exprs with nulls with correct types instead and therefore avoiding needing to coerce the expr after changing it?

Or at least when it looked at the coerce logic it seemed quite complex and I have a hard time convincing my self that it can never affect the result of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Let me fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the coerce function. But some Exprs that is restrict null predicate cannot be judged to be true.
https://github.com/JasonLi-cn/arrow-datafusion/blob/3bb49de20476a12a0b0b42ae6869991a8d83cb7c/datafusion/optimizer/src/utils.rs#L263-L303

But since these casts are handled with type_coercion, they are also acceptable.
https://github.com/JasonLi-cn/arrow-datafusion/blob/fdc9f5974579ab6169c1577d8000c315df0219d7/datafusion/sqllogictest/test_files/joins.slt#L4321-L4337

}

// If result is single `true`, return false;
// If result is single `NULL` or `false`, return true;
Ok(match phys_expr.evaluate(&input_batch)? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did stop using phys_expr.evaluate? Did that not allow us to handle more cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eliminate more outer joins by supporting more expressions
3 participants