feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709
feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709fivetran-kwoodbeck wants to merge 5 commits into
Conversation
SQLGlot Integration Test Results❌ 4 regressions — see details belowComparing:
By Dialect
Overallmain: 192441 total, 153536 passed (pass rate: 79.8%) sqlglot:transpile/snowflake-startswith: 106860 total, 106856 passed (pass rate: 100.0%) Transitions: Dialect pair changes: 0 previous results not found, 2 current results not found ❌ 4 regressions (view logs) |
90cc2e1 to
9c55e84
Compare
| existing_where = expression.args.get("where") | ||
| existing_with = expression.args.get("with_") | ||
|
|
||
| # WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them. |
There was a problem hiding this comment.
What if you have SELECT * FROM x WHERE c > 1 START WITH ...? Does the extra_cols logic result in duplicating c in the projection list? That may cause ambiguity issues if that's the case. Let's check this.
There was a problem hiding this comment.
The extra_cols block is guarded by if not has_star else {} above so when the query uses SELECT *, extra_cols is always empty, so c is not added to the projection list twice. There's a specific test to cover this case (SELECT * FROM (SELECT * FROM emp ...) WHERE c > 1) and verifies there's no duplication.
9c55e84 to
bb9744a
Compare
|
Let me know when this is ready for another review. |
bb9744a to
c1f4d7f
Compare
| connect_pred = connect.args.get("connect") | ||
| if not connect_pred: | ||
| return expression |
There was a problem hiding this comment.
This is a required arg:
| connect_pred = connect.args.get("connect") | |
| if not connect_pred: | |
| return expression | |
| connect_pred = connect.args["connect"] |
There was a problem hiding this comment.
TypeError: 'dict' object is not callable, I assume you're suggesting to just remove the if?
There was a problem hiding this comment.
I meant both removing if and switching to [] syntax instead of get().
| priors = list(connect_pred.find_all(exp.Prior)) | ||
| if len(priors) != 1: | ||
| return expression | ||
|
|
There was a problem hiding this comment.
What is the difficulty with transpiling multiple PRIORs? Isn't it a matter of combining the predicates with AND?
There was a problem hiding this comment.
For multiple PRIOR, we can add support at this spot fairly easily, but it makes cycle detection below quite a bit more complicated. We have a few options: 1) Remove cycle detection (Snowflake doesn't support it, so it's not strictly necessary to match Snowflake's behavior), 2) Simple fix to add in multiple PRIOR but leave cycle detection on one column only, 3) do the full correct cycle detection, probably another 30-40 lines of code to an already overly complex function.
I'd vote for Option 1 and call cycle detection out of scope for this PR.
There was a problem hiding this comment.
Snowflake doesn't support it, so it's not strictly necessary to match Snowflake's behavior
I was under the impression that it does support it, through the NOCYCLE specifier. If it doesn't, then the whole cycle detection trick we implemented can be cleaned up, otherwise, if what I say is correct, we should only emit that code when NOCYCLE is present to match Snowflake 100%.
I think it's worth to outline what the solution for handling >1 PRIOR clauses will look like, simply for future reference. Other than that, I'm happy to pause here and fall back to unsupported for now. We can always revisit this later.
| prior = priors[0] | ||
| if not isinstance(prior.this, exp.Column): | ||
| return expression |
There was a problem hiding this comment.
Can this ever not be a Column and still be valid Snowflake? If not, then we don't need this, as the input itself would be invalid and sqlglot doesn't need to validate it or take it into account.
There was a problem hiding this comment.
Yes it can be non Column. If we have PRIOR (x + 1), it will put Prior(Add(...)) in the AST. The existing tests fail if this if is removed.
| def _is_level_ref(e: exp.Expr) -> bool: | ||
| col = e.this if isinstance(e, exp.Alias) else e | ||
| return isinstance(col, exp.Column) and col.name.upper() == "LEVEL" and not col.table | ||
|
|
||
| has_level = any(_is_level_ref(e) for e in base_select_exprs) |
There was a problem hiding this comment.
You can also have SELECT level + 1 ..., for example. I don't think you'd detect level in that case.
There was a problem hiding this comment.
Good point, I'll make it find_all to look for a LEVEL.
| return isinstance(col, exp.Column) and col.name.upper() == "LEVEL" and not col.table | ||
|
|
||
| has_level = any(_is_level_ref(e) for e in base_select_exprs) | ||
| has_star = any(e.is_star for e in base_select_exprs) |
There was a problem hiding this comment.
| has_star = any(e.is_star for e in base_select_exprs) | |
| has_star = expression.is_star |
| # With SELECT *, all columns are already included so no extras are needed. | ||
| select_names = {e.alias_or_name for e in base_select_exprs} |
There was a problem hiding this comment.
| # With SELECT *, all columns are already included so no extras are needed. | |
| select_names = {e.alias_or_name for e in base_select_exprs} | |
| # With SELECT *, all columns are already included so no extras are needed. | |
| select_names = expression.named_selects |
| extra_cols = ( | ||
| dict.fromkeys( | ||
| col.name | ||
| for col in (base_where.find_all(exp.Column) if base_where else []) | ||
| if (not col.table or col.table == source_table_alias) and col.name not in select_names | ||
| ) | ||
| if not has_star | ||
| else {} | ||
| ) |
There was a problem hiding this comment.
I suggest rewriting this into a proper branch instead of a ternary expression, it's not easily readable now.
| # Anchor: seed LEVEL at 1 and start the ancestor path with the root's parent key value. | ||
| anchor_projs = [e for e in base_select_exprs if not _is_level_ref(e)] + [ | ||
| exp.column(c) for c in extra_cols | ||
| ] |
There was a problem hiding this comment.
There are quite a few traversals of base_select_exprs. I think I can count 6. Is it possible to merge some of those loops? Seems wasteful.
There was a problem hiding this comment.
There's a couple of optimizations, yes, but they're mostly doing different things. Will update.
c1f4d7f to
09ac7c9
Compare
Adds Snowflake to DuckDB transpilation for
START WITH ... CONNECT BY PRIORhierarchical queries.DuckDB rejects
CONNECT BYverbatim. This addsconnect_by_to_recursive_cteregistered as a preprocessor onexp.Select, which rewrites the query into aWITH RECURSIVE CTEat generation time:START WITHrows from the source table_child_rowjoined to theCTEaliased _parent_row on thePRIORpredicateSELECTre-projects from_rootcte, applyingWHEREpost-hierarchy and forwardingORDER BY, LIMIT, OFFSETFalls through unchanged for
NOCYCLE, multiplePRIORs, or non-equality predicates. CTE name collision is avoided viafind_new_name.WHEREcolumns absent fromSELECTare carried through the CTE so the outer filter can see them.LEVELis injected as a depth counter unless already projected.Tests in sqlglot-integration-tests cover: SELECT *, explicit column list with ORDER BY, PRIOR on left side, WHERE carry-through, existing CTE preservation, and name collision.