-
-
Notifications
You must be signed in to change notification settings - Fork 238
Implement fast join ordering for large joins that can be implemented as a series of lookups. #2064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… to efficiently compute a join ordering for large joins.
sql/memo/join_order_builder.go
Outdated
| return | ||
| } | ||
| if currentlyJoinedTables.contains(j.findVertexFromGroup(tablesInEdge[0])) { | ||
| if nextEdgeIdx != -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is re-using the graph step (edge) variable as a check for whether or not we are attempting to branch our DFS? Took me a bit to understand, an extra variable might be easier to read. Alternatively, could maybe separate collecting next possible steps (if find >2 break), and executing the step post-validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the method to make it easier to follow.
sql/memo/join_order_builder.go
Outdated
| innerEdges edgeSet | ||
| nonInnerEdges edgeSet | ||
| newPlanCb func(j *joinOrderBuilder, rel RelExpr) | ||
| forceFastReorderForTest bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing as before for better var name, fastDFSLookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make clear that this field is only used in tests and shouldn't be set in production code. A field name is harder to ignore than a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to forceFastDFSLookupForTest. How's that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just feels weird marking an attribute as a test accessor when probably one of two things is accurate. 1) If you really want to make sure no one in GMS can use the flag, you could have tests manually call the fast path. 2) It's a legitimate optimization that we might make available via a join hint or otherwise, in which case the "test" part doesn't really add anything. In-between seems messier from an organizational standpoint. Obv not a critical thing, just an organization thing.
| if set.Intersects(cols2) { | ||
| cols2 = cols2.Union(set) | ||
| var oldClosure ColSet | ||
| newClosure := source.Copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing left join tests affected, wondering whether it's still correct in those cases, or if we would make overly aggressive FD assumptions for a left join based on a partial FD for nullified RHS row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the naming changes, adding a doc comment might improve this is bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left joins pass in nil for fdsKeys, which preserves the previous behavior. I added a docstring.
…ins of 15+ tables can still take minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to ship after that last correctness bug this unearthed. Thanks for the doc comments and refactoring the lookup DFS. One stylistic comment considering when we might expose fast lookup as a join hint in the future.
Fixes dolthub/dolt#6713
This PR is built on top of #2063 and includes its changes, plus two additional commits.
The purpose of this PR is to attempt to construct a left-deep join ordering where every join is possible to optimize into a lookup. It uses Functional Dependencies to identify the column set that determines every other set in the final join. If a "lookup only join order" exists, then these columns must be in the innermost join. We use that as the basis and pick join filters one at a time to create a new join plan.
Currently this only handles the case where a single column determines every other column, and only the case where at each juncture, only a single possible choice can be made for the next table in the join. This is sufficient to handle the queries used in
sqllogictestsRight now, this is intentionally limited in scope: it's designed to only affect the extremely large (up to 64 tables) joins used in
sqllogictests, and won't currently be used outside of large joins. But there's no reason that this can't be used for other joins. In fact, in some cases I observed this generate correct join plans that our current brute-force reordering misses. (edge.applicableclaims that these ordering violate a conflict rule even though they should be correct. We may be overly conservative in some places.)