Skip to content

feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709

Open
fivetran-kwoodbeck wants to merge 5 commits into
mainfrom
transpile/snowflake-startswith
Open

feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY#7709
fivetran-kwoodbeck wants to merge 5 commits into
mainfrom
transpile/snowflake-startswith

Conversation

@fivetran-kwoodbeck

Copy link
Copy Markdown
Collaborator

Adds Snowflake to DuckDB transpilation for START WITH ... CONNECT BY PRIOR hierarchical queries.

DuckDB rejects CONNECT BY verbatim. This adds connect_by_to_recursive_cte registered as a preprocessor on exp.Select, which rewrites the query into a WITH RECURSIVE CTE at generation time:

  • Anchors START WITH rows from the source table
  • Recursive arm: source table aliased _child_row joined to the CTE aliased _parent_row on the PRIOR predicate
  • Outer SELECT re-projects from _rootcte, applying WHERE post-hierarchy and forwarding ORDER BY, LIMIT, OFFSET

Falls through unchanged for NOCYCLE, multiple PRIORs, or non-equality predicates. CTE name collision is avoided via find_new_name. WHERE columns absent from SELECT are carried through the CTE so the outer filter can see them. LEVEL is injected as a depth counter unless already projected.

Tests in sqlglot-integration-tests cover: SELECT *, explicit column list with ORDER BY, PRIOR on left side, WHERE carry-through, existing CTE preservation, and name collision.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

SQLGlot Integration Test Results

❌ 4 regressions — see details below

Comparing:

  • this branch (sqlglot:transpile/snowflake-startswith @ sqlglot 0481e32)
  • baseline (main @ sqlglot 87b1da8)

By Dialect

dialect main feature branch transitions links
databricks -> databricks 9982/11820 passed (84.5%) 1366/1370 passed (99.7%) 4 pass -> fail full result / delta

Overall

main: 192441 total, 153536 passed (pass rate: 79.8%)

sqlglot:transpile/snowflake-startswith: 106860 total, 106856 passed (pass rate: 100.0%)

Transitions:
4 pass -> fail

Dialect pair changes: 0 previous results not found, 2 current results not found

❌ 4 regressions (view logs)

@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from 90cc2e1 to 9c55e84 Compare June 5, 2026 20:19
@georgesittas georgesittas self-requested a review June 8, 2026 10:16
@georgesittas georgesittas self-assigned this Jun 8, 2026
Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
existing_where = expression.args.get("where")
existing_with = expression.args.get("with_")

# WHERE columns absent from SELECT must be carried through the CTE so the outer filter can see them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you have SELECT * FROM x WHERE c > 1 START WITH ...? Does the extra_cols logic result in duplicating c in the projection list? That may cause ambiguity issues if that's the case. Let's check this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra_cols block is guarded by if not has_star else {} above so when the query uses SELECT *, extra_cols is always empty, so c is not added to the projection list twice. There's a specific test to cover this case (SELECT * FROM (SELECT * FROM emp ...) WHERE c > 1) and verifies there's no duplication.

Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
Comment thread sqlglot/generators/duckdb.py Outdated
@fivetran-kwoodbeck fivetran-kwoodbeck changed the title feat(duckdb): Add transpilation support for Snowflake STARTS WITH feat(duckdb): Add transpilation support for Snowflake STARTS WITH ... CONNECT BY Jun 8, 2026
@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from 9c55e84 to bb9744a Compare June 9, 2026 14:04
@georgesittas

Copy link
Copy Markdown
Collaborator

Let me know when this is ready for another review.

@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from bb9744a to c1f4d7f Compare June 11, 2026 14:03
Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +719 to +721
connect_pred = connect.args.get("connect")
if not connect_pred:
return expression

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a required arg:

Suggested change
connect_pred = connect.args.get("connect")
if not connect_pred:
return expression
connect_pred = connect.args["connect"]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError: 'dict' object is not callable, I assume you're suggesting to just remove the if?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant both removing if and switching to [] syntax instead of get().

Comment on lines +723 to +726
priors = list(connect_pred.find_all(exp.Prior))
if len(priors) != 1:
return expression

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difficulty with transpiling multiple PRIORs? Isn't it a matter of combining the predicates with AND?

@fivetran-kwoodbeck fivetran-kwoodbeck Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multiple PRIOR, we can add support at this spot fairly easily, but it makes cycle detection below quite a bit more complicated. We have a few options: 1) Remove cycle detection (Snowflake doesn't support it, so it's not strictly necessary to match Snowflake's behavior), 2) Simple fix to add in multiple PRIOR but leave cycle detection on one column only, 3) do the full correct cycle detection, probably another 30-40 lines of code to an already overly complex function.

I'd vote for Option 1 and call cycle detection out of scope for this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake doesn't support it, so it's not strictly necessary to match Snowflake's behavior

I was under the impression that it does support it, through the NOCYCLE specifier. If it doesn't, then the whole cycle detection trick we implemented can be cleaned up, otherwise, if what I say is correct, we should only emit that code when NOCYCLE is present to match Snowflake 100%.

I think it's worth to outline what the solution for handling >1 PRIOR clauses will look like, simply for future reference. Other than that, I'm happy to pause here and fall back to unsupported for now. We can always revisit this later.

Comment on lines +727 to +729
prior = priors[0]
if not isinstance(prior.this, exp.Column):
return expression

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever not be a Column and still be valid Snowflake? If not, then we don't need this, as the input itself would be invalid and sqlglot doesn't need to validate it or take it into account.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can be non Column. If we have PRIOR (x + 1), it will put Prior(Add(...)) in the AST. The existing tests fail if this if is removed.

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +745 to +749
def _is_level_ref(e: exp.Expr) -> bool:
col = e.this if isinstance(e, exp.Alias) else e
return isinstance(col, exp.Column) and col.name.upper() == "LEVEL" and not col.table

has_level = any(_is_level_ref(e) for e in base_select_exprs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also have SELECT level + 1 ..., for example. I don't think you'd detect level in that case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll make it find_all to look for a LEVEL.

Comment thread sqlglot/generators/duckdb.py Outdated
return isinstance(col, exp.Column) and col.name.upper() == "LEVEL" and not col.table

has_level = any(_is_level_ref(e) for e in base_select_exprs)
has_star = any(e.is_star for e in base_select_exprs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
has_star = any(e.is_star for e in base_select_exprs)
has_star = expression.is_star

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +753 to +754
# With SELECT *, all columns are already included so no extras are needed.
select_names = {e.alias_or_name for e in base_select_exprs}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# With SELECT *, all columns are already included so no extras are needed.
select_names = {e.alias_or_name for e in base_select_exprs}
# With SELECT *, all columns are already included so no extras are needed.
select_names = expression.named_selects

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +755 to +763
extra_cols = (
dict.fromkeys(
col.name
for col in (base_where.find_all(exp.Column) if base_where else [])
if (not col.table or col.table == source_table_alias) and col.name not in select_names
)
if not has_star
else {}
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest rewriting this into a proper branch instead of a ternary expression, it's not easily readable now.

Comment thread sqlglot/generators/duckdb.py Outdated
Comment on lines +768 to +771
# Anchor: seed LEVEL at 1 and start the ancestor path with the root's parent key value.
anchor_projs = [e for e in base_select_exprs if not _is_level_ref(e)] + [
exp.column(c) for c in extra_cols
]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few traversals of base_select_exprs. I think I can count 6. Is it possible to merge some of those loops? Seems wasteful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of optimizations, yes, but they're mostly doing different things. Will update.

@fivetran-kwoodbeck fivetran-kwoodbeck force-pushed the transpile/snowflake-startswith branch from c1f4d7f to 09ac7c9 Compare June 15, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants