Simplify EXPR LIKE 'constant' to expr = 'constant'#13061
Simplify EXPR LIKE 'constant' to expr = 'constant'#13061adriangb wants to merge 5 commits intoapache:mainfrom
EXPR LIKE 'constant' to expr = 'constant'#13061Conversation
datafusion/optimizer/Cargo.toml
Outdated
| chrono = { workspace = true } | ||
| datafusion-common = { workspace = true, default-features = true } | ||
| datafusion-expr = { workspace = true } | ||
| datafusion-functions = { workspace = true } |
There was a problem hiding this comment.
Not sure if this is okay, or if there is a way to avoid it?
There was a problem hiding this comment.
I think we need to make some interface for rewriting the function through 🤔
Perhaps adding a method to SimplifyInfo like
/// return a new Expr with a function call for exact string prefix max
fn make_starts_with(base: Expr, prefix: Expr) -> Option<Expr>There was a problem hiding this comment.
Hmm not sure what you mean. Is this a pattern already in use somewhere I can learn from?
There was a problem hiding this comment.
I was suggest adding a method to https://docs.rs/datafusion/latest/datafusion/logical_expr/simplify/trait.SimplifyInfo.html
However i can't quite figure out how to plumb it in without getting messy
THe trick is that this rewrite is based on the semantcs of starts_with -- so I think logically it would need to live along side the implementation of starts_with because it depends on those semantics.
And similarly, to do the rewrite in the pruning predicate we need something similar (somehow to provide a per function rewrite) 🤔
There was a problem hiding this comment.
Hmm okay I'm happy to work on this more but I fear I am not familiar enough with the relationship between all of these things to figure out how to wire everything up correctly. Maybe we pause this work until someone with more familiarity of these bits of code can suggest a way to structure this?
There was a problem hiding this comment.
I will try and help push this forward this week. I am sorry for the delay but I have been super busy the last few days
|
Wow, this is now rewriting |
|
I thought more about this PR the other day -- given the dependecy you have identified, I wonder if we might actually want to do the opposite Specifically rewrite starts_with(x, 'foo')into x LIKE 'foo%'With the rationale that I'll try and find a ticket to track the idea |
|
Hmm interesting. The dependency is a bit strange to me, I'm not sure why functions would depend on the optimizer, I would think the optimizer could depend on functions. |
alamb
left a comment
There was a problem hiding this comment.
@adriangb what would you think about updating this PR to handle the no wild card case.
The more I think about the prefix case the less I am convinced that translating to some other function would be valuable. Instead I am thinking maybe we can ensure the LIKE evaluation is fast (perhaps it can detect this case and invoke a better kernel) 🤔
|
To handle ONLY the no wildcard case? Happy to do that if that's all the progress we can make for now. |
| if index == pattern.len() - 1 && !case_insensitive && escape_char.is_none() { | ||
| let prefix = pattern[..index].to_string(); | ||
| let new_expr = Expr::ScalarFunction(ScalarFunction { | ||
| func: datafusion_functions::string::starts_with(), |
There was a problem hiding this comment.
Today a downstream project can choose which functions to use, this is configurable in the session, right? The DF repertoire of functions serves as a convenient package, but not mandatory.
Here we're inserting implicit function call to a specific function.
Is this following a pre-existing pattern? if so, i am OK with this.
But maybe it would be better to eg consult session state whether this function was supposed to be used. What do you think?
| Expr::Like(Like { | ||
| expr, | ||
| pattern, | ||
| expr: ref like_expr, |
There was a problem hiding this comment.
keep expr as expr
use like_expr when referring to the whole Expr::Like thing
| if pattern == "%" && !is_null(like_expr) { | ||
| return Ok(Transformed::yes(lit(!negated))); |
There was a problem hiding this comment.
This is cool idea, but is_null(like_expr) checks makes it incorrect.
This function returns "is this always null". Applied to a column reference, it will return false, so this optimization will kick in, but the column values are still nullable.
The "more correct" way would be to
- optimize
WHERE col LIKE '%'toWHERE col IS NOT NULL - optimize
WHERE col NOT LIKE '%'toWHERE false.
However, this simplification cannot be done here. It can only be done for top-level filter conjuncts, or when expr being simplified is immediately used as a condition (eg IF(expr, ...)). In other cases, this need to be NULL-preserving boolean logic:
trino> SELECT a, a LIKE '%', a NOT LIKE '%'
-> FROM (VALUES NULL, '', 'x') t(a);
a | _col1 | _col2
------+-------+-------
NULL | NULL | NULL
| true | false
x | true | false
(3 rows)
There was a problem hiding this comment.
The easiest solution would be to just drop this special case.
I.e. expr LIKE '%' would get simplified to starts_with(expr, '')
| let pct_wildcard_index = pattern.find('%'); | ||
| let underscore_index = pattern.find('_'); |
|
|
||
| // Handle single % at end case (prefix search) | ||
| if let Some(index) = pct_wildcard_index { | ||
| if index == pattern.len() - 1 && !case_insensitive && escape_char.is_none() { |
| let expr = not_ilike(col("c1"), "%"); | ||
| assert_eq!(simplify(expr), lit(false)); | ||
|
|
||
| let expr = like(col("c1"), "a%"); |
There was a problem hiding this comment.
add test cases
-
with
%pattern -
with
_pattern (can be simplified to a length check, but that can be follow up) -
with
%%and%%%patterns (can be simplified just like single%) -
with
__and___patterns (can be simplified to a length check, but that can be follow up) -
with patterns combining % and _, eg
a_b%,a%b_(these cannot simplified) and_%,%_(these can be simplified to a length check, but that can be follow up)
|
do we have equivalent of #12978 for starts_with? |
EXPR LIKE 'constant' to expr = 'constant'
| if !is_null(&like_expr.expr) && pattern_str == "%" { | ||
| Transformed::yes(lit(!like_expr.negated)) |
There was a problem hiding this comment.
I don't think this is correct. Did you consider #13061 (comment) ?
There was a problem hiding this comment.
This code was already here (see diff above). Did I change the logic in any way?
There was a problem hiding this comment.
Ack. Can you please confirm or disconfirm the correctness concern?
There was a problem hiding this comment.
I agree this is incorrect but this is the logic that was there before this PR. I'm happy to remove it. I guess no one has complained because the case that breaks is SELECT NULL LIKE '%' which isn't something that happens under normal usage.
There was a problem hiding this comment.
I agree this is incorrect but this is the logic that was there before this PR.
thanks for confirming and understood it's pre-existing.
Can it be fixed though?
i personally would prefer that git annotate on the file wouldn't point at me as the author of something that's known to be incorrect.
There was a problem hiding this comment.
My latest commit removes that special case. I'm not sure how to fix it, your comment suggests it can't be fixed right?
There was a problem hiding this comment.
I guess no one has complained because the case that breaks is
SELECT NULL LIKE '%'which isn't something that happens under normal usage.
actually the opposite.
this was correct only for this case
and wrong for all other ... LIKE '%' queries.
see #13259
| } else if !pattern_str.contains(['%', '_'].as_ref()) { | ||
| // If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression |
There was a problem hiding this comment.
Add TODO to handle escaped search characters in a follow-up.
Escaped wildcards are not uncommon in certain scenarios - eg correct use of JDBC DatabaseMetaData will produce LIKE foo\_bar when inspecting object foo_bar.
There was a problem hiding this comment.
I made it not simplify if there is an escape character for now
| let expr = like(col("c1"), "a_"); | ||
| assert_eq!(simplify(expr), col("c1").like(lit("a_"))); | ||
| let expr = not_like(col("c1"), "a_"); | ||
| assert_eq!(simplify(expr), col("c1").not_like(lit("a_"))); |
There was a problem hiding this comment.
add a test cases where pattern is constant NULL (it won't be simplified today, that's fine)
| Transformed::yes(lit(!like_expr.negated)) | ||
| } else if !pattern_str.contains(['%', '_'].as_ref()) { | ||
| // If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression | ||
| if like_expr.escape_char.is_some() { |
There was a problem hiding this comment.
nit: can this condition be checked on the same line with !pattern_str.contains(['%', '_']
it would reduce code branching.
| Expr::Literal(ScalarValue::Utf8(Some(pattern_str))) if pattern_str == "%" | ||
| ) => | ||
| { | ||
| Transformed::yes(lit(!negated)) |
There was a problem hiding this comment.
remove this is correct, but requires updating some test cases.
i am doing this in #13259
There was a problem hiding this comment.
awesome, I'll wait for your work to be merged then continue here
|
Closing in favor of #13260 |
#12978 (comment)
Closes #13192