Skip to content

[SPARK-31257][SPARK-33561][SQL] Unify create table syntax #28026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 25, 2020

What changes were proposed in this pull request?

  • Unify the create table syntax in the parser by merging Hive and DataSource clauses
  • Add SerdeInfo and external boolean to statement plans and update AstBuilder to produce them
  • Add conversion from create statement plan to v1 create plans in ResolveSessionCatalog
  • Support new statement clauses in ResolveCatalogs conversion to v2 create plans
  • Remove SparkSqlParser rules for Hive syntax
  • Add "option." namespace to distinguish SERDEPROPERTIES and OPTIONS in table properties

Why are the changes needed?

  • Current behavior is confusing.
  • A way to pass the Hive create options to DSv2 is needed for a Hive source.

Does this PR introduce any user-facing change?

Not by default, but v2 sources will be able to handle STORED AS and other Hive clauses.

How was this patch tested?

Existing tests validate there are no behavior changes.

Update unit tests for using a statement plan for Hive create syntax:

  • Move create tests from spark-sql DDLParserSuite into PlanResolutionSuite
  • Add parser tests to spark-catalyst DDLParserSuite

@HeartSaVioR
Copy link
Contributor

Could you please pick up SPARK-31257 instead as the JIRA issue explains better for the "situation" we are in? It's still good to keep the PR title as it is, as we tend to explain the "problem/issue" in JIRA and explain "how" to deal with it in PR.

@rdblue rdblue changed the title [SPARK-18885][SQL] Unify create table syntax (WIP) [SPARK-31257][SQL] Unify create table syntax (WIP) Mar 25, 2020
@SparkQA
Copy link

SparkQA commented Mar 26, 2020

Test build #120376 has finished for PR 28026 at commit 0522c93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I understand that we want to make Hive implements the v2 API eventually, but can we focus on syntax unification right now? Let's not change the behavior, for example, EXTERNAL should still be disallowed when creating native data source tables.

@HeartSaVioR
Copy link
Contributor

I don't know who marked comments as resolved so please correct me if I'm wrong, but assuming that @cloud-fan comments to the resolved comment, it doesn't seem @cloud-fan marked comments as resolved.

Could we make sure comments are marked as resolved only when both agree?

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked the details because the amount of code change is not minor and now I wonder I'm qualified as it doesn't only touch parser side.

Could you please add guide comments by yourself to reduce the burdens to review? Like the method you moved to other class (even you modified a bit). Without such guidance all of changed lines are what we should focus on.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 29, 2020

Same here, please describe TODO list for this PR to remove WIP.

If I skimmed the code correctly, this "requires" end users to add USING hive even they add Hive create table specific clause. Do I understand correctly?

It's OK if that's one of TODO that let some clauses change the provider - we should concern about documentation to make clear which clause changes the provider. (Personally I'm not in favor of this, but otherwise why not just add HIVE to second syntax as both are requiring end users to change their query and adding HIVE is clearer?)

If the change is intentional, that's in my favor but I may concern about removing compatibility flag, as this will force end users to migrate their query in any way.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 30, 2020

I don't know who marked comments as resolved

That was me because I thought they were fairly straight-forward and definitive answers. We can still discuss them more since that wasn't the case.

If I skimmed the code correctly, this "requires" end users to add USING hive even they add Hive create table specific clause.

That's not quite right.

The parser now creates a CreateTableStatement that contains everything that was parsed from the SQL string, with minimal interpretation. That is what the user requested, assuming that request was well-formed -- meaning that it doesn't have duplicate clauses or mixed partition fields.

The next step is to convert that statement to a plan. At the moment, this only converts to v2 plans because we no longer use v1 and this was ported from our internal version. The interpretation of that statement for v1 -- converting to v1 CreateTable -- has not yet been added. I plan to implement that like this:

  • If provider is present from USING, set the CatalogTable provider and validate that no SerdeInfo is set.
  • If SerdeInfo is present, validate that provider was not set and use a Hive CatalogTable.
  • If neither SerdeInfo nor provider is set, use SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED or similar setting to set a default provider or default serde properties.

Speaking of well-formed, I think it might be reasonable to ensure that either a provider or serde info is present, but not both. That would raise an exception if both USING and STORED AS or SERDE clauses were used. I think that's probably reasonable. In that case, we could skip the checks above. What do you think?

Coming back to the current implementation for the v2 interface: Spark is passing through everything that was parsed because Spark doesn't know what kind of table the catalog is going to create. The v2 catalog that we use can create Hive tables, Spark datasource tables, and Iceberg tables, so all of the information needs to be passed through -- again, with the requirement that the SQL was well-formed.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 30, 2020

Spark hides the EXTERNAL TABLE concept from users

From Hive's documentation for managed tables:

The default location can be overridden by the location property during table creation

Hive allows EXTERNAL with a custom LOCATION, so I don't think that Spark should prevent this from being passed to catalogs. A catalog may choose to allow this. Spark can prevent its internal catalog from creating such tables, but it can't prevent them from existing.

Since this is a valid configuration for a table, it must be possible to pass this through to a v2 catalog.

@HeartSaVioR
Copy link
Contributor

Thanks for the detailed answer. Really appreciated.

I misunderstood something as you commented, but it seems to come from something which is not addressed yet and the plan was not shared. Without sharing the detailed plan (TODO) I wonder the review can be done correctly. Maybe we'd be better to wait for you to make more progress?

Btw, this change doesn't seem to be trivial but PR description has shortage of information - one line per a major change. I guess we tend to describe at least how it works in high level on the PR description so that commit message itself becomes informative. We may want to get it once the PR is ready to merge.

Back to the comment, your plan looks promising to me, but I'm not an expert of DSv2 (what I originally pointed out was ambiguous parser syntaxes, not about DSv2) so I feel I'm not qualified.

Speaking of well-formed, I think it might be reasonable to ensure that either a provider or serde info is present, but not both. That would raise an exception if both USING and STORED AS or SERDE clauses were used. I think that's probably reasonable. In that case, we could skip the checks above. What do you think?

Would we like to disallow STORED AS/SERDE clauses if the provider is specified to Hive via USING clause? That would be still OK if we provide good error message and guide alternatives (either doc or error message).

@SparkQA
Copy link

SparkQA commented Mar 31, 2020

Test build #120620 has finished for PR 28026 at commit 12a17fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Hive allows EXTERNAL with a custom LOCATION, so I don't think that Spark should prevent this from being passed to catalogs.

This is similar to the table option/property discussion. Hive has more features (more flexible parser) than Spark, but it doesn't mean Spark needs to accept them all.

When we introduced the Spark native CREATE TABLE syntax, we evaluated the Hive syntax and decided to forbid the EXTERNAL keyword. Hive requires LOCATION to be specified if EXTERNAL exists, but also allows LOCATION to be specified even there is no EXTERNAL. So the missing feature is: creating a managed table with a custom location. This feature is weird and actually Hive metatore gives a warning message if the client creates managed tables with custom location, which implies that Hive doesn't recommend using this feature as well.

I think we shouldn't put EXTERNAL into v2 API and we should probably forbid EXTERNAL for hive tables as well eventually, to be consistent with Spark native tables.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 31, 2020

Hive has more features (more flexible parser) than Spark, but it doesn't mean Spark needs to accept them all.

I don't think Spark needs to make a choice on this for the v2 API. It is up to the catalog what features of a table format are useful or not. Because Spark supports parsing EXTERNAL, it makes no sense to arbitrarily suppress it. That requires more code in Spark that is specific to what is now a responsibility delegated to a catalog. It breaks the abstraction.

@cloud-fan
Copy link
Contributor

EXTERNAL is never supported in Spark native tables, and it's intentional. At least at that time, we thought EXTERNAL is a bad feature and Spark shouldn't have it. Currently Spark supports it only for Hive compatibility, and it's an option to make the v2 API simpler by dropping EXTERNAL completely and sacrificing Hive compatibility.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 31, 2020

What you're suggesting is to maintain compatibility with Hive SQL, but arbitrarily limit what a source connector can do.

How about this: users shouldn't be allowed to use the | character as the delimiter for Hive's delimited format in sequence files. Let's add special code to Spark to check for that case and fail it before passing everything to the catalog.

Hopefully you see my point: making decisions about what catalogs should support is not a reasonable thing for Spark to do. Spark should reject what is not well-formed, but should not decide details for specific sources.

And to be clear, if you want to do this in Spark's built-in Hive catalog/connector then that's fine with me. But this kind of decision should be made by the catalog, not in the parser or in logical plans.

@cloud-fan
Copy link
Contributor

You made a point that the same problem occurs for STORED AS/ROW FORMAT DELIMITED. How are we going to support them in the v2 catalog?

IIRC the v2 catalog API was design based on the native CREATE TABLE syntax, i.e. provider, table properties, etc. These Hive specific syntaxes were not considered.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 31, 2020

Hive syntax was considered. Anything specific to a source will be passed as table properties.

In this PR, these are passed as follows:

  • STORED AS formatName -> hive.stored-as=formatName
  • INPUTFORMAT formatClass -> hive.input-format=formatClass
  • OUTPUTFORMAT formatClass -> hive.output-format=formatClass
  • SERDE serdeClass -> hive.serde=serdeClass
  • SERDEPROPERTIES ('key'='value') -> option.key=value

Again, the only thing that Spark should be doing here is ensuring the data is well-formed and then passing it directly to the source.

@rdblue rdblue force-pushed the unify-create-table branch from 12a17fc to 8bf400d Compare April 1, 2020 01:01
@rdblue
Copy link
Contributor Author

rdblue commented Apr 1, 2020

@cloud-fan, I've implemented the conversions to v1 plans and updated the tests like the other commands that we moved to use statements. Tests from DDLParserSuite are now in PlanResolutionSuite. I also needed to move a create test from SparkSqlParserSuite into PlanResolutionSuite for the same reason.

The create test in SparkSqlParserSuite highlighted an existing problem with spark.sql.legacy.createHiveTableByDefault.enabled. With that table property disabled (default to USING), Hive was still used to create a table because Hive's PARTITION BY syntax was used. I think this is very confusing behavior! (@HeartSaVioR probably agrees.)

Because of this problem, I think we should either roll back the patch to make USING the default in 3.0 since it is not applied everywhere. That, or get this syntax unification into 3.0.

.add("b", StringType)
)
)
withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "true") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, this is the SQL that uses Hive without this patch because of the PARTITION BY clause (and the statement using this config option below).

@SparkQA
Copy link

SparkQA commented Apr 1, 2020

Test build #120658 has finished for PR 28026 at commit 8bf400d.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 1, 2020

@HeartSaVioR, here's some additional info to give you context.

When we convert plans to v2, we alter the parser to produce a Statement plan that contains what was parsed, without modification. The reason is that the v1 plans put a lot of logic about what was allowed in the parser itself, which required the parser to be broken across spark-catalyst and spark-sql modules and was difficult to maintain. Now, logic about what is allowed is in the analyzer.

Statement plans are converted to either the exact same v1 logical plan as before, or a v2 plan. To validate that we are producing the exact same logical plans, we move tests from the spark-sql DDLParserSuite (only runs the parser) into PlanResolutionSuite that runs the parser and conversion rules. These tests should be nearly identical to the original and are as close to a copy & paste as possible to be easy to verify. I just completed this step, which is why tests should be passing.

In addition, we add tests to spark-catalyst's DDLParserSuite that test SQL to Statement conversion. This is where we will also add new tests for the unified behavior, like rejecting PARTITION BY clauses with both expressions and column declarations. These are the last tests that I need to add for this PR and I'll work on them tomorrow.

@rdblue rdblue force-pushed the unify-create-table branch from 8bf400d to 50019e6 Compare April 1, 2020 01:29
@SparkQA
Copy link

SparkQA commented Apr 1, 2020

Test build #120659 has finished for PR 28026 at commit 50019e6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Apr 1, 2020

The create test in SparkSqlParserSuite highlighted an existing problem with spark.sql.legacy.createHiveTableByDefault.enabled. With that table property disabled (default to USING), Hive was still used to create a table because Hive's PARTITION BY syntax was used. I think this is very confusing behavior! (@HeartSaVioR probably agrees.)

Sure, I totally agree. That's one of the issue I mentioned in mail thread, refer below link:

https://lists.apache.org/thread.html/ra2aa31ed7cbd4e34d3504adc97cae1301cc249bfb8f95565b808b0cb%40%3Cdev.spark.apache.org%3E

It's wrong if someone simply thinks that if some words (ROW FORMAT, STORED AS) exist then Hive create table will be used and if not then native create table will be used. There're more differences on details, including the point @rdblue pointed out here. The issue didn't exist in Spark 2.x, as USING clause plays as a marker to distinguish two syntaxes.

Because of this problem, I think we should either roll back the patch to make USING the default in 3.0 since it is not applied everywhere. That, or get this syntax unification into 3.0.

That's in line with my proposal as well. The ideal approach would be reverting SPARK-30098 in Spark 3.0 and make it correct with syntax unification in further version. I've also proposed some alternatives (add a marker, turn on legacy config by default) but haven't heard any feedback.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131608 has finished for PR 28026 at commit 1558478.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131611 has finished for PR 28026 at commit fd60b4f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131612 has finished for PR 28026 at commit a471f33.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131625 has finished for PR 28026 at commit a471f33.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131639 has finished for PR 28026 at commit a471f33.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@rdblue there are several mistakes in my previous PR that cause test failures, I'm fixing them in rdblue#9

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131683 has finished for PR 28026 at commit 25ec746.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131713 has finished for PR 28026 at commit 25ec746.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

seems the hive tests are very flaky.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131736 has finished for PR 28026 at commit 25ec746.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131762 has finished for PR 28026 at commit 25ec746.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

This PR is definitely not related to Spark R. Since jenkins is flaky now, I'm merging it first. Will keep watching the jenkins status.

@cloud-fan cloud-fan changed the title [SPARK-31257][SQL] Unify create table syntax [SPARK-31257]SPARK-33561[SQL] Unify create table syntax Nov 25, 2020
@cloud-fan cloud-fan changed the title [SPARK-31257]SPARK-33561[SQL] Unify create table syntax [SPARK-31257][SPARK-33561][SQL] Unify create table syntax Nov 25, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6f68ccf Nov 25, 2020
@rdblue
Copy link
Contributor Author

rdblue commented Nov 25, 2020

Thanks for merging this, @cloud-fan! And thanks for the PRs as well.

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan and @rdblue . It's good to have this, but this seems to break Scala 2.13 compilation.

cc @srowen and @HyukjinKwon

@dongjoon-hyun
Copy link
Member

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131784 has finished for PR 28026 at commit 25ec746.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.