-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-31257][SPARK-33561][SQL] Unify create table syntax #28026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you please pick up SPARK-31257 instead as the JIRA issue explains better for the "situation" we are in? It's still good to keep the PR title as it is, as we tend to explain the "problem/issue" in JIRA and explain "how" to deal with it in PR. |
Test build #120376 has finished for PR 28026 at commit
|
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Show resolved
Hide resolved
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
I understand that we want to make Hive implements the v2 API eventually, but can we focus on syntax unification right now? Let's not change the behavior, for example, EXTERNAL should still be disallowed when creating native data source tables. |
I don't know who marked comments as resolved so please correct me if I'm wrong, but assuming that @cloud-fan comments to the resolved comment, it doesn't seem @cloud-fan marked comments as resolved. Could we make sure comments are marked as resolved only when both agree? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked the details because the amount of code change is not minor and now I wonder I'm qualified as it doesn't only touch parser side.
Could you please add guide comments by yourself to reduce the burdens to review? Like the method you moved to other class (even you modified a bit). Without such guidance all of changed lines are what we should focus on.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
Same here, please describe TODO list for this PR to remove WIP. If I skimmed the code correctly, this "requires" end users to add It's OK if that's one of TODO that let some clauses change the provider - we should concern about documentation to make clear which clause changes the provider. (Personally I'm not in favor of this, but otherwise why not just add If the change is intentional, that's in my favor but I may concern about removing compatibility flag, as this will force end users to migrate their query in any way. |
That was me because I thought they were fairly straight-forward and definitive answers. We can still discuss them more since that wasn't the case.
That's not quite right. The parser now creates a The next step is to convert that statement to a plan. At the moment, this only converts to v2 plans because we no longer use v1 and this was ported from our internal version. The interpretation of that statement for v1 -- converting to v1
Speaking of well-formed, I think it might be reasonable to ensure that either a provider or serde info is present, but not both. That would raise an exception if both Coming back to the current implementation for the v2 interface: Spark is passing through everything that was parsed because Spark doesn't know what kind of table the catalog is going to create. The v2 catalog that we use can create Hive tables, Spark datasource tables, and Iceberg tables, so all of the information needs to be passed through -- again, with the requirement that the SQL was well-formed. |
From Hive's documentation for managed tables:
Hive allows Since this is a valid configuration for a table, it must be possible to pass this through to a v2 catalog. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Show resolved
Hide resolved
Thanks for the detailed answer. Really appreciated. I misunderstood something as you commented, but it seems to come from something which is not addressed yet and the plan was not shared. Without sharing the detailed plan (TODO) I wonder the review can be done correctly. Maybe we'd be better to wait for you to make more progress? Btw, this change doesn't seem to be trivial but PR description has shortage of information - one line per a major change. I guess we tend to describe at least how it works in high level on the PR description so that commit message itself becomes informative. We may want to get it once the PR is ready to merge. Back to the comment, your plan looks promising to me, but I'm not an expert of DSv2 (what I originally pointed out was ambiguous parser syntaxes, not about DSv2) so I feel I'm not qualified.
Would we like to disallow STORED AS/SERDE clauses if the provider is specified to Hive via USING clause? That would be still OK if we provide good error message and guide alternatives (either doc or error message). |
Test build #120620 has finished for PR 28026 at commit
|
This is similar to the table option/property discussion. Hive has more features (more flexible parser) than Spark, but it doesn't mean Spark needs to accept them all. When we introduced the Spark native CREATE TABLE syntax, we evaluated the Hive syntax and decided to forbid the EXTERNAL keyword. Hive requires LOCATION to be specified if EXTERNAL exists, but also allows LOCATION to be specified even there is no EXTERNAL. So the missing feature is: creating a managed table with a custom location. This feature is weird and actually Hive metatore gives a warning message if the client creates managed tables with custom location, which implies that Hive doesn't recommend using this feature as well. I think we shouldn't put EXTERNAL into v2 API and we should probably forbid EXTERNAL for hive tables as well eventually, to be consistent with Spark native tables. |
I don't think Spark needs to make a choice on this for the v2 API. It is up to the catalog what features of a table format are useful or not. Because Spark supports parsing EXTERNAL, it makes no sense to arbitrarily suppress it. That requires more code in Spark that is specific to what is now a responsibility delegated to a catalog. It breaks the abstraction. |
|
What you're suggesting is to maintain compatibility with Hive SQL, but arbitrarily limit what a source connector can do. How about this: users shouldn't be allowed to use the Hopefully you see my point: making decisions about what catalogs should support is not a reasonable thing for Spark to do. Spark should reject what is not well-formed, but should not decide details for specific sources. And to be clear, if you want to do this in Spark's built-in Hive catalog/connector then that's fine with me. But this kind of decision should be made by the catalog, not in the parser or in logical plans. |
You made a point that the same problem occurs for IIRC the v2 catalog API was design based on the native CREATE TABLE syntax, i.e. provider, table properties, etc. These Hive specific syntaxes were not considered. |
Hive syntax was considered. Anything specific to a source will be passed as table properties. In this PR, these are passed as follows:
Again, the only thing that Spark should be doing here is ensuring the data is well-formed and then passing it directly to the source. |
12a17fc
to
8bf400d
Compare
@cloud-fan, I've implemented the conversions to v1 plans and updated the tests like the other commands that we moved to use statements. Tests from DDLParserSuite are now in PlanResolutionSuite. I also needed to move a create test from SparkSqlParserSuite into PlanResolutionSuite for the same reason. The create test in SparkSqlParserSuite highlighted an existing problem with Because of this problem, I think we should either roll back the patch to make USING the default in 3.0 since it is not applied everywhere. That, or get this syntax unification into 3.0. |
.add("b", StringType) | ||
) | ||
) | ||
withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, this is the SQL that uses Hive without this patch because of the PARTITION BY
clause (and the statement using this config option below).
Test build #120658 has finished for PR 28026 at commit
|
@HeartSaVioR, here's some additional info to give you context. When we convert plans to v2, we alter the parser to produce a Statement plans are converted to either the exact same v1 logical plan as before, or a v2 plan. To validate that we are producing the exact same logical plans, we move tests from the spark-sql In addition, we add tests to spark-catalyst's |
8bf400d
to
50019e6
Compare
Test build #120659 has finished for PR 28026 at commit
|
Sure, I totally agree. That's one of the issue I mentioned in mail thread, refer below link: It's wrong if someone simply thinks that if some words (
That's in line with my proposal as well. The ideal approach would be reverting SPARK-30098 in Spark 3.0 and make it correct with syntax unification in further version. I've also proposed some alternatives (add a marker, turn on legacy config by default) but haven't heard any feedback. |
Test build #131608 has finished for PR 28026 at commit
|
Test build #131611 has finished for PR 28026 at commit
|
Test build #131612 has finished for PR 28026 at commit
|
retest this please |
Test build #131625 has finished for PR 28026 at commit
|
retest this please |
Test build #131639 has finished for PR 28026 at commit
|
Test build #131683 has finished for PR 28026 at commit
|
retest this please |
Test build #131713 has finished for PR 28026 at commit
|
seems the hive tests are very flaky. |
retest this please |
Test build #131736 has finished for PR 28026 at commit
|
retest this please |
Test build #131762 has finished for PR 28026 at commit
|
retest this please |
This PR is definitely not related to Spark R. Since jenkins is flaky now, I'm merging it first. Will keep watching the jenkins status. |
thanks, merging to master! |
Thanks for merging this, @cloud-fan! And thanks for the PRs as well. |
Hi, @cloud-fan and @rdblue . It's good to have this, but this seems to break Scala 2.13 compilation. cc @srowen and @HyukjinKwon |
I made a follow-up. |
Test build #131784 has finished for PR 28026 at commit
|
What changes were proposed in this pull request?
SerdeInfo
andexternal
boolean to statement plans and update AstBuilder to produce themWhy are the changes needed?
Does this PR introduce any user-facing change?
Not by default, but v2 sources will be able to handle STORED AS and other Hive clauses.
How was this patch tested?
Existing tests validate there are no behavior changes.
Update unit tests for using a statement plan for Hive create syntax: