Skip to content

Conversation

@leoyvens
Copy link
Contributor

Which issue does this PR close?

Closes #10875.

Rationale for this change

This logic is extracted from SessionState::resolve_table_references, but can be used without a context making it more flexible as a helper function for catalog management.

What changes are included in this PR?

The resolve_table_references function will only return the tables from the information schema if it actually implicitly requires them due to a SHOW statement.

A new public function catalog::resolve_table_references is added, containing the same logic as SessionState::resolve_table_references but in a more flexible interface.

Are these changes tested?

Yes there is a doctest and it's well covered by sqllogictests as it's executed as part of SQL planning.

Are there any user-facing changes?

A new public function is added.

The behaviour of SessionState::resolve_table_references around the information schema is changed, which could theoretically break users relying on that directly, but it's quite an edge case.

@github-actions github-actions bot added the core Core DataFusion crate label Jun 11, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR @leoyvens -- this is a very nice API to expose.

I think this PR is quite close.

Thanks @jonahgao for the review

@leoyvens leoyvens force-pushed the statement-resolve-table-references branch from f467b41 to a15c149 Compare June 13, 2024 14:14
@leoyvens
Copy link
Contributor Author

Thank you for the careful review, I have addressed all comments.

The CTE case had not occurred to me so thanks for pointing it out. I have separated the CTE list in the return value, returning a (table_refs, ctes) tuple, as I suspect most uses of this are not interested in CTE names and would rather exclude them.

@jonahgao
Copy link
Member

jonahgao commented Jun 13, 2024

Thank you for the careful review, I have addressed all comments.

The CTE case had not occurred to me so thanks for pointing it out. I have separated the CTE list in the return value, returning a (table_refs, ctes) tuple, as I suspect most uses of this are not interested in CTE names and would rather exclude them.

Thank you @leoyvens . But I prefer the previous implementation and take alamb's approach to remind users in the documentation. So that if users find a table reference that does not exist in their catalog, they do not immediately judge it as an error because it could be a CTE name.

Implementing a separated CTE list could be error-prone, especially considering the scope of CTEs. For example, in WITH t AS (SELECT * FROM t) SELECT * FROM t, the first 't' may reference an ordinary table, and the second 't' references a CTE.
And we already have specialized logic to handle this problem during planning.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 14, 2024
@leoyvens
Copy link
Contributor Author

Thanks @jonahgao for pointing out that shadowing case, I didn't know that was possible and the change really would have introduced a bug. I've added a sql test to make sure we don't regress there.

But I did give another shot at correctly separating ordinary table refs from CTE refs. Let me know if the approach still has problems and there are other edges cases.

@leoyvens leoyvens force-pushed the statement-resolve-table-references branch from 80ae34f to 8a3cfc5 Compare June 14, 2024 10:45
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me -- thank you @leoyvens for your attention to detail

I also think we should wait a day or two before merging this to make sure @jonahgao has a chance to give it another review if desired.

Thanks again

@alamb alamb changed the title add catalog::resolve_table_references Add catalog::resolve_table_references Jun 14, 2024
@jonahgao
Copy link
Member

jonahgao commented Jun 14, 2024

Hi @leoyvens , for the following query, the table reference in the subquery on the right side of the UNION will still be shadowed by the cte.

(with t as (select 1) select * from t) union (select * from t);

The situation becomes more complicated when working with nested CTEs and recursive CTEs.
A recursive CTE can reference itself internally, which is different from a non-recursive CTE.

I think there might be three possible solutions.

  1. The initial one.
  2. Make this function not support CTE, that is, report an error when a CTE is detected.
  3. Achieve full-fledged CTE support. This might require a structure similar to PlannerContext to store the visible CTEs within the current query scope. Each query has its own PlannerContext, which subqueries can inherit and override. We might also need to handle recursive CTEs specially.

For the latter two solutions, I think it need to be implemented as a standalone function, meaning SeesionState should no longer use it. This is due to the lack of CTE support or the presence of redundant work.

@leoyvens
Copy link
Contributor Author

@jonahgao You're a SQL grammar wizard, I didn't know nested CTEs were even a thing, thanks for pointing out all those edge cases. I've handled and tested them all to the best of my understanding, by using fn post_visit_query to pop the CTEs out of scope. Thanks for this review I'm learning a lot.

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you @leoyvens. I think this is an improvement because it no longer searches for CTEs from the catalog.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @leoyvens and @jonahgao

# Name shadowing:
# The first `t` refers to the table, the second to the CTE.
query I
WITH t AS (SELECT * FROM t where t.a < 2) SELECT * FROM t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this query fails on main:

DataFusion CLI v39.0.0
> WITH t AS (SELECT * FROM t where t.a < 2) SELECT * FROM t;
Error during planning: table 'datafusion.public.t' not found

@alamb alamb merged commit c4fd754 into apache:main Jun 17, 2024
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* resolve information_schema references only when necessary

* add `catalog::resolve_table_references` as a public utility

* collect CTEs separately in resolve_table_references

* test CTE name shadowing

* handle CTE name shadowing in resolve_table_references

* handle unions, recursive and nested CTEs in resolve_table_references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helper to list all table references in a SQL query

3 participants