Skip to content

Conversation

@nicktobey
Copy link
Contributor

During joins, we still evaluate every filter for every candidate row. But based on the join implementation, some of those filters must necessarily be true, so we don't need to evaluate them.

In most joins the performance cost of this isn't that bad, but this problem is most noticeable in LookupJoins where a point lookup is constructed from several columns, at which point the filter evaluation can dominate the runtime.

One potential drawback of this change is that this might make reading query plans more obtuse, because some filters are no longer explicitly listed in the plan, but are a consequence of Join node's children. One option to prevent this would be to add metadata to joins listing filters that they assume to hold, or marking filters in the execution plan as "skipped." However, I think that described execution plans should match the actual behavior of the execution as closely as possible, so this may not actually be a concern.

Some notes about the individual commits in this PR:

3b54dbe: This commit fixes an existing bug which caused lookups to return extra rows if the filter was a null-safe-equals check, and the key is non-null. This bug previously caused no issues because these extra rows would not match the filter and would get dropped. Now that we're skipping filters we know to be extraneous, this bug would manifest if not fixed.

48e8777: This commit changes the costing function of lookup joins. In the event that the lookup expressions can't be proved to uniquely identify a row, we attempt to estimate what percentage of the rows will be returned by the lookup. It's a very rough estimate, and serves mostly to guide which index to use if the table has multiple indexes: the more filters the index is able to make redundant, the better. The previous costing implementation had a special case for indexes where every column in the index was filtered on, and assigned that index a score somewhere between a 3-key lookup and a 4-key lookup. I imagine that the thought process when this was implemented was a lookup that used every column in an index would tend to result in very few rows compared to a lookup using an index prefix. Of course this is data dependent and I'm not convinced is generally true.

2b384a3: This commit updates tests that check for specific query plans. Most of these updates are just eliminating filters. A small number of updates are a result of changing the lookup costing, resulting in different join types. I haven't looked at these closely yet. This may be okay, it may not be.

@nicktobey nicktobey requested review from jycor and max-hoffman July 21, 2023 22:53
@max-hoffman
Copy link
Contributor

#benchmark

@nicktobey
Copy link
Contributor Author

I created #1889 to benchmark just the costing change on its own.

found := false
for _, matchedFilter := range matchedFilters {
if filter == matchedFilter {
found = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do

filters = append(filters, filter)
break

and get rid of the found flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could make matchedFilters a map, and avoid a nested for loop. Doubtful this will really make a difference in runtime.

Copy link
Contributor Author

@nicktobey nicktobey Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose behind the loop is to compute the set difference: we want to fill filters with every filter from rel.Filter that isn't also in matchedFilters. Obnoxiously there's no way to do this in the standard library that I could find.

Something with a map would be asymptotically faster, but a nested loop is likely faster for all real uses cases. (You'd need a lot of filters in your join before the overhead of maintaining a map becomes worth it.)

@nicktobey
Copy link
Contributor Author

Turns out pretty much all of the performance wins came from #1889. Removing the filters from joins is nice for simplifying plans, but impacts performance only negligibly, and this is potentially a risky change.

@nicktobey
Copy link
Contributor Author

So it looks like the described plans include the index being used for the lookup, but not what the key expressions on the index are. This is important information that is needed to fully understand the plan, and if we remove the filters there's no way to infer that information.

Adding that information wouldn't be too hard, but it's not trivial either and getting this in isn't a priority at the moment.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one small comment

for _, filter := range rel.Filter {
found := false
for _, matchedFilter := range matchedFilters {
if filter == matchedFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to just match with e.ExprId()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! That's not only more efficient, but that allows us to detect matches through table aliases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to use e.ExprId() appears to have broken some correctness tests. I'm going to merge the previous version.

@nicktobey nicktobey force-pushed the nicktobey/remove-filters branch from f937703 to 1c849f6 Compare September 29, 2023 16:56
@max-hoffman
Copy link
Contributor

@nicktobey I'm inclined to close this unless you feel strongly

@timsehn
Copy link
Contributor

timsehn commented Oct 17, 2023

I have a three month rule. You have until Oct 21.

@nicktobey nicktobey merged commit 8c0e16c into main Oct 18, 2023
@nicktobey nicktobey deleted the nicktobey/remove-filters branch October 18, 2023 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants