-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve eliminate_outer_join rule #13249
base: main
Are you sure you want to change the base?
improve eliminate_outer_join rule #13249
Conversation
improve eliminate_outer_join rule
This is an experiment, and i need help to determine if this solution is feasible. If feasible, I will:
|
.iter() | ||
.filter(|col| child_schema_columns.contains(*col)) | ||
.cloned(); | ||
is_restrict_null_predicate(join_schema, predicate_cloned, cols).unwrap_or(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting find that this was already available 🥳
the substrait bug blocks the current PR from passing |
} | ||
|
||
#[test] | ||
fn eliminate_right_with_null_udf() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test name makes it should to be that we expect it to eliminate the join, but we do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since always_null(t1.b) IS NULL
is not a restrict null predicate, it cannot eliminate the join. This is probably repeated with eliminate_left_with_null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion would be to name the test not_eliminate
since we do not expect the test case to eliminate the outer join.
|
||
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | ||
Ok(DataType::Boolean) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need to implement an invoke
method here? Or how does one know the test passes due to us properly checking volatility and not because the invoke method produces an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right. Maybe like this?
fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> {
panic!()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That or maybe some simple implementation. I would lead towards implementing it, just so so there is less chance of some other code change causing a panic here.
fn always_null<S: SimplifyInfo>(expr: &Expr, info: &S) -> bool { | ||
is_null(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have some comment explaining when to use always_null
over just is_null
? Because to me it seems like one would always want always_null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_null
function cannot contain a non-Expr::Literal
case where data type is DataType::Null
(eg. Expr::Column). Maybe we should change the name of always_null
? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need that distinction? Or could we make all current usage of is_null
use always_null
?
If that is the case it seems like we should just improve is_null
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need these change now that we produce use null columns of the correct type instead of DataType::Null
?
// If result is single `true`, return false; | ||
// If result is single `NULL` or `false`, return true; | ||
Ok(match phys_expr.evaluate(&input_batch)? { | ||
ColumnarValue::Array(array) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we no longer need to handle exprs returning single element arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike before, we have now added the logic of 'expression simplification'. In ExprSimplifier
, the previous evaluate
is implemented by ConstEvaluator
.
datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Lines 633 to 650 in 0458d30
let phys_expr = | |
match create_physical_expr(&expr, &self.input_schema, self.execution_props) { | |
Ok(e) => e, | |
Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr), | |
}; | |
let col_val = match phys_expr.evaluate(&self.input_batch) { | |
Ok(v) => v, | |
Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr), | |
}; | |
match col_val { | |
ColumnarValue::Array(a) => { | |
if a.len() != 1 { | |
ConstSimplifyResult::SimplifyRuntimeError( | |
DataFusionError::Execution(format!("Could not evaluate the expression, found a result of length {}", a.len())), | |
expr, | |
) | |
} else if as_list_array(&a).is_ok() { | |
ConstSimplifyResult::Simplified(ScalarValue::List( |
datafusion/optimizer/src/utils.rs
Outdated
let ret = match &expr { | ||
Expr::Literal(scalar) if scalar.is_null() => true, | ||
Expr::Literal(ScalarValue::Boolean(Some(b))) => !b, | ||
_ if matches!(expr.get_type(input_schema)?, DataType::Null) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this does not depend on evaluating expr, should we make this condition be an early return at the top of the method instead?
datafusion/optimizer/src/utils.rs
Outdated
let null_column = Column::from_name(DUMMY_COL_NAME); | ||
let replace_columns = cols_of_predicate.into_iter().collect(); | ||
let replaced_predicate = replace_expr_with_null(predicate, &replace_columns)?; | ||
let coerced_predicate = coerce(replaced_predicate, input_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have access to the schema here could could we replace the exprs with nulls with correct types instead and therefore avoiding needing to coerce
the expr after changing it?
Or at least when it looked at the coerce logic it seemed quite complex and I have a hard time convincing my self that it can never affect the result of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Let me fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the coerce
function. But some Exprs that is restrict null predicate cannot be judged to be true.
https://github.com/JasonLi-cn/arrow-datafusion/blob/3bb49de20476a12a0b0b42ae6869991a8d83cb7c/datafusion/optimizer/src/utils.rs#L263-L303
But since these cast
s are handled with type_coercion
, they are also acceptable.
https://github.com/JasonLi-cn/arrow-datafusion/blob/fdc9f5974579ab6169c1577d8000c315df0219d7/datafusion/sqllogictest/test_files/joins.slt#L4321-L4337
} | ||
|
||
// If result is single `true`, return false; | ||
// If result is single `NULL` or `false`, return true; | ||
Ok(match phys_expr.evaluate(&input_batch)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did stop using phys_expr.evaluate
? Did that not allow us to handle more cases?
improve eliminate_outer_join rule
Which issue does this PR close?
Closes #13232
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?