Skip to content

[SPARK-38404][SQL] Improve CTE resolution when a nested CTE references an outer CTE #36146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Apr 11, 2022

What changes were proposed in this pull request?

Please note that the bug in the SPARK-38404 is fixed already with #34929.
This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions.

Why are the changes needed?

Small improvement + additional tests.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new test case.

@github-actions github-actions bot added the SQL label Apr 11, 2022
@peter-toth peter-toth changed the title Spark 38404 nested cte references outer cte [SPARK-38404][SQL] Fix nested CTE references outer CTE Apr 11, 2022
@peter-toth peter-toth force-pushed the SPARK-38404-nested-cte-references-outer-cte branch from 6e3e1e8 to 92cf4ca Compare April 11, 2022 17:38
@peter-toth peter-toth changed the title [SPARK-38404][SQL] Fix nested CTE references outer CTE [SPARK-38404][SQL] Fix CTE resolution when a nested CTE references an outer CTE Apr 11, 2022
@peter-toth
Copy link
Contributor Author

cc @cloud-fan, @maryannxue, @sigmod

As this is a regression from 3.1 to 3.2, it would be great to fix it in 3.3. cc @MaxGekk

@peter-toth peter-toth closed this Apr 12, 2022
@peter-toth peter-toth reopened this Apr 12, 2022
@sigmod
Copy link
Contributor

sigmod commented Apr 12, 2022

cc @dtenedor

@peter-toth peter-toth force-pushed the SPARK-38404-nested-cte-references-outer-cte branch from 92cf4ca to 7b1f741 Compare April 19, 2022 13:58
@peter-toth peter-toth changed the title [SPARK-38404][SQL] Fix CTE resolution when a nested CTE references an outer CTE [SPARK-38404][SQL] Improve CTE resolution when a nested CTE references an outer CTE Apr 19, 2022
@peter-toth
Copy link
Contributor Author

I updated this PR after we had the conversation here: #34929 (comment)

if (!(isLegacy || isCommand)) {
cteDefs += cteRelation
}
// Prepending new CTEs makes sure that those have higher priority over outer ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about priority between CTE relations at the same level? Previously we append new CTE relations to resolvedCTERelations which means the left-most relation has highest priority, but it's different now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems safer to keep resolvedCTERelations as it was, and when we call traverseAndSubstituteCTE or substituteCTE, we pass resolvedCTERelations ++ outerCTEDefs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we currently allow duplicate names at a given level:

WITH
^^^
  t1 AS (SELECT 1),
  t1 AS (SELECT 1 + (SELECT * FROM t1))
SELECT * FROM t1
    

org.apache.spark.sql.catalyst.parser.ParseException: 
CTE definition can't have duplicate names: 't1'.(line 2, pos 0)

But if we allowed this construct I would expect 2 as result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 488f392 Apr 20, 2022
@peter-toth
Copy link
Contributor Author

Thanks for the review @cloud-fan.

}
if (cteDefs.isEmpty) {
substituted
} else if (substituted eq lastSubstituted.get) {
WithCTE(substituted, cteDefs.sortBy(_.id).toSeq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the order is guaranteed by other changes in this PR so we are safe to remove this sortBy here?

Copy link
Contributor Author

@peter-toth peter-toth Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peter-toth added a commit to peter-toth/spark that referenced this pull request Sep 1, 2022
…s an outer CTE

### What changes were proposed in this pull request?
Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with apache#34929.
This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions.

### Why are the changes needed?
Small improvement + additional tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new test case.

Closes apache#36146 from peter-toth/SPARK-38404-nested-cte-references-outer-cte.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants