-
-
Couldn't load subscription status.
- Fork 239
Remove filters from LookupJoins when they're provably not required. #1888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…returns all non-null rows.
… in the lookup was a strong candidate, but that's not necessarily true.
|
#benchmark |
|
I created #1889 to benchmark just the costing change on its own. |
| found := false | ||
| for _, matchedFilter := range matchedFilters { | ||
| if filter == matchedFilter { | ||
| found = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do
filters = append(filters, filter)
breakand get rid of the found flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, could make matchedFilters a map, and avoid a nested for loop. Doubtful this will really make a difference in runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose behind the loop is to compute the set difference: we want to fill filters with every filter from rel.Filter that isn't also in matchedFilters. Obnoxiously there's no way to do this in the standard library that I could find.
Something with a map would be asymptotically faster, but a nested loop is likely faster for all real uses cases. (You'd need a lot of filters in your join before the overhead of maintaining a map becomes worth it.)
|
Turns out pretty much all of the performance wins came from #1889. Removing the filters from joins is nice for simplifying plans, but impacts performance only negligibly, and this is potentially a risky change. |
|
So it looks like the described plans include the index being used for the lookup, but not what the key expressions on the index are. This is important information that is needed to fully understand the plan, and if we remove the filters there's no way to infer that information. Adding that information wouldn't be too hard, but it's not trivial either and getting this in isn't a priority at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, one small comment
| for _, filter := range rel.Filter { | ||
| found := false | ||
| for _, matchedFilter := range matchedFilters { | ||
| if filter == matchedFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to just match with e.ExprId()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! That's not only more efficient, but that allows us to detect matches through table aliases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to use e.ExprId() appears to have broken some correctness tests. I'm going to merge the previous version.
f937703 to
1c849f6
Compare
|
@nicktobey I'm inclined to close this unless you feel strongly |
|
I have a three month rule. You have until Oct 21. |
During joins, we still evaluate every filter for every candidate row. But based on the join implementation, some of those filters must necessarily be true, so we don't need to evaluate them.
In most joins the performance cost of this isn't that bad, but this problem is most noticeable in LookupJoins where a point lookup is constructed from several columns, at which point the filter evaluation can dominate the runtime.
One potential drawback of this change is that this might make reading query plans more obtuse, because some filters are no longer explicitly listed in the plan, but are a consequence of Join node's children. One option to prevent this would be to add metadata to joins listing filters that they assume to hold, or marking filters in the execution plan as "skipped." However, I think that described execution plans should match the actual behavior of the execution as closely as possible, so this may not actually be a concern.
Some notes about the individual commits in this PR:
3b54dbe: This commit fixes an existing bug which caused lookups to return extra rows if the filter was a null-safe-equals check, and the key is non-null. This bug previously caused no issues because these extra rows would not match the filter and would get dropped. Now that we're skipping filters we know to be extraneous, this bug would manifest if not fixed.
48e8777: This commit changes the costing function of lookup joins. In the event that the lookup expressions can't be proved to uniquely identify a row, we attempt to estimate what percentage of the rows will be returned by the lookup. It's a very rough estimate, and serves mostly to guide which index to use if the table has multiple indexes: the more filters the index is able to make redundant, the better. The previous costing implementation had a special case for indexes where every column in the index was filtered on, and assigned that index a score somewhere between a 3-key lookup and a 4-key lookup. I imagine that the thought process when this was implemented was a lookup that used every column in an index would tend to result in very few rows compared to a lookup using an index prefix. Of course this is data dependent and I'm not convinced is generally true.
2b384a3: This commit updates tests that check for specific query plans. Most of these updates are just eliminating filters. A small number of updates are a result of changing the lookup costing, resulting in different join types. I haven't looked at these closely yet. This may be okay, it may not be.