Skip to content

Conversation

@nicktobey
Copy link
Contributor

Fixes dolthub/dolt#6713

This PR is built on top of #2063 and includes its changes, plus two additional commits.

The purpose of this PR is to attempt to construct a left-deep join ordering where every join is possible to optimize into a lookup. It uses Functional Dependencies to identify the column set that determines every other set in the final join. If a "lookup only join order" exists, then these columns must be in the innermost join. We use that as the basis and pick join filters one at a time to create a new join plan.

Currently this only handles the case where a single column determines every other column, and only the case where at each juncture, only a single possible choice can be made for the next table in the join. This is sufficient to handle the queries used in sqllogictests

Right now, this is intentionally limited in scope: it's designed to only affect the extremely large (up to 64 tables) joins used in sqllogictests, and won't currently be used outside of large joins. But there's no reason that this can't be used for other joins. In fact, in some cases I observed this generate correct join plans that our current brute-force reordering misses. (edge.applicable claims that these ordering violate a conflict rule even though they should be correct. We may be overly conservative in some places.)

@nicktobey nicktobey requested a review from max-hoffman October 9, 2023 23:57
return
}
if currentlyJoinedTables.contains(j.findVertexFromGroup(tablesInEdge[0])) {
if nextEdgeIdx != -1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is re-using the graph step (edge) variable as a check for whether or not we are attempting to branch our DFS? Took me a bit to understand, an extra variable might be easier to read. Alternatively, could maybe separate collecting next possible steps (if find >2 break), and executing the step post-validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the method to make it easier to follow.

innerEdges edgeSet
nonInnerEdges edgeSet
newPlanCb func(j *joinOrderBuilder, rel RelExpr)
forceFastReorderForTest bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing as before for better var name, fastDFSLookup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make clear that this field is only used in tests and shouldn't be set in production code. A field name is harder to ignore than a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to forceFastDFSLookupForTest. How's that?

Copy link
Contributor

@max-hoffman max-hoffman Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just feels weird marking an attribute as a test accessor when probably one of two things is accurate. 1) If you really want to make sure no one in GMS can use the flag, you could have tests manually call the fast path. 2) It's a legitimate optimization that we might make available via a join hint or otherwise, in which case the "test" part doesn't really add anything. In-between seems messier from an organizational standpoint. Obv not a critical thing, just an organization thing.

if set.Intersects(cols2) {
cols2 = cols2.Union(set)
var oldClosure ColSet
newClosure := source.Copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing left join tests affected, wondering whether it's still correct in those cases, or if we would make overly aggressive FD assumptions for a left join based on a partial FD for nullified RHS row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the naming changes, adding a doc comment might improve this is bit more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left joins pass in nil for fdsKeys, which preserves the previous behavior. I added a docstring.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to ship after that last correctness bug this unearthed. Thanks for the doc comments and refactoring the lookup DFS. One stylistic comment considering when we might expose fast lookup as a join hint in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dolt takes forever to generate join plans for a join of many tables.

2 participants