-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-38404][SQL] Improve CTE resolution when a nested CTE references an outer CTE #36146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38404][SQL] Improve CTE resolution when a nested CTE references an outer CTE #36146
Conversation
6e3e1e8
to
92cf4ca
Compare
cc @cloud-fan, @maryannxue, @sigmod As this is a regression from 3.1 to 3.2, it would be great to fix it in 3.3. cc @MaxGekk |
cc @dtenedor |
92cf4ca
to
7b1f741
Compare
I updated this PR after we had the conversation here: #34929 (comment) |
if (!(isLegacy || isCommand)) { | ||
cteDefs += cteRelation | ||
} | ||
// Prepending new CTEs makes sure that those have higher priority over outer ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about priority between CTE relations at the same level? Previously we append new CTE relations to resolvedCTERelations
which means the left-most relation has highest priority, but it's different now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems safer to keep resolvedCTERelations
as it was, and when we call traverseAndSubstituteCTE
or substituteCTE
, we pass resolvedCTERelations ++ outerCTEDefs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we currently allow duplicate names at a given level:
WITH
^^^
t1 AS (SELECT 1),
t1 AS (SELECT 1 + (SELECT * FROM t1))
SELECT * FROM t1
org.apache.spark.sql.catalyst.parser.ParseException:
CTE definition can't have duplicate names: 't1'.(line 2, pos 0)
But if we allowed this construct I would expect 2
as result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
thanks, merging to master! |
Thanks for the review @cloud-fan. |
} | ||
if (cteDefs.isEmpty) { | ||
substituted | ||
} else if (substituted eq lastSubstituted.get) { | ||
WithCTE(substituted, cteDefs.sortBy(_.id).toSeq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the order is guaranteed by other changes in this PR so we are safe to remove this sortBy
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because we add new CTE defs to cteDefs
immediately after creation: https://github.com/apache/spark/pull/36146/files#diff-4d16a733f8741de9a4b839ee7c356c3e9b439b4facc70018f5741da1e930c6a8R234-R236
…s an outer CTE ### What changes were proposed in this pull request? Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with apache#34929. This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions. ### Why are the changes needed? Small improvement + additional tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test case. Closes apache#36146 from peter-toth/SPARK-38404-nested-cte-references-outer-cte. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Please note that the bug in the SPARK-38404 is fixed already with #34929.
This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions.
Why are the changes needed?
Small improvement + additional tests.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new test case.