Skip to content

[SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax #28517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This patch effectively reverts SPARK-30098 via below changes:

  • Removed the config
  • Removed the changes done in parser rule
  • Removed the usage of config in tests
    • Removed tests which depend on the config
    • Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098
  • Reflect the change into docs (migration doc, create table syntax)

Why are the changes needed?

SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change.

Please go through the discussion thread to see the details.

Does this PR introduce any user-facing change?

No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases.

How was this patch tested?

Existing UTs.

@@ -2195,22 +2195,4 @@ class DDLParserSuite extends AnalysisTest {
parsePlan("COMMENT ON TABLE a.b.c IS 'xYz'"),
CommentOnTable(UnresolvedTable(Seq("a", "b", "c")), "xYz"))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't exist before SPARK-30098, and this test depends on SPARK-30098.

@@ -256,24 +256,6 @@ class DataSourceV2SQLSuite
checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
}

test("CreateTable: without USING clause") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't exist before SPARK-30098, and this test depends on SPARK-30098.

@@ -681,24 +663,6 @@ class DataSourceV2SQLSuite
}
}

test("CreateTableAsSelect: without USING clause") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't exist before SPARK-30098, and this test depends on SPARK-30098.

@@ -76,12 +75,6 @@ class DDLParserSuite extends AnalysisTest with SharedSparkSession {
}.head
}

private def withCreateTableStatement(sql: String)(prediction: CreateTableStatement => Unit)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to check with DSV2 but Hive create table syntax doesn't follow the path. That said, it didn't exist before SPARK-30098.

@@ -40,8 +40,7 @@ import org.apache.spark.sql.test.SharedSparkSession
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the changes in this test suite are simply done via reverting back to pre-SPARK-30098.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: https://github.com/apache/spark/blame/c1a5f94973213b1cad15388f3ef8a488424c34a7/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala

I had to remove the line assert(desc.viewDefaultDatabase.isEmpty) as desc.viewDefaultDatabase no longer exists. I hope it is expected.

@@ -42,7 +42,6 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
private val originalInMemoryPartitionPruning = TestHive.conf.inMemoryPartitionPruning
private val originalCrossJoinEnabled = TestHive.conf.crossJoinEnabled
private val originalSessionLocalTimeZone = TestHive.conf.sessionLocalTimeZone
private val originalCreateHiveTable = TestHive.conf.createHiveTableByDefaultEnabled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively no-op.

@@ -25,22 +25,6 @@ import org.apache.spark.sql.internal.{HiveSerDe, SQLConf}

class HiveShowCreateTableSuite extends ShowCreateTableSuite with TestHiveSingleton {

private var origCreateHiveTableConfig = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively no-op.

@@ -87,8 +87,7 @@ class HiveSerDeSuite extends HiveComparisonTest with PlanTest with BeforeAndAfte
SQLConf.withExistingConf(TestHive.conf)(super.withSQLConf(pairs: _*)(f))

test("Test the default fileformat for Hive-serde tables") {
withSQLConf("hive.default.fileformat" -> "orc",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively no-op.

@@ -97,8 +96,7 @@ class HiveSerDeSuite extends HiveComparisonTest with PlanTest with BeforeAndAfte
assert(desc.storage.serde == Some("org.apache.hadoop.hive.ql.io.orc.OrcSerde"))
}

withSQLConf("hive.default.fileformat" -> "parquet",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively no-op.

@@ -2706,33 +2706,6 @@ class HiveDDLSuite
}
}

test("SPARK-30098: create table without provider should " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't exist before SPARK-30098, and this test depends on SPARK-30098.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests are depending on the config, hence I'd rather not restore these tests.

@HeartSaVioR
Copy link
Contributor Author

I'd like to see this be applied to branch-3.0, and even master branch, so that we don't leave any bad state in any branches.

This would require the PR of "unified create table syntax" (#28026) be rebased, but I don't expect we will merge the PR which contains the changes of +1,661 -1,146 lines, so I don't think it matters to us. Once there's a consensus on the direction of #28026 then we'd probably like to break down the PR into multiple PRs.

@@ -944,7 +944,7 @@ class AdaptiveQueryExecSuite
withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
SQLConf.ADAPTIVE_EXECUTION_FORCE_APPLY.key -> "true") {
withTable("t1") {
val plan = sql("CREATE TABLE t1 AS SELECT 1 col").queryExecution.executedPlan
val plan = sql("CREATE TABLE t1 USING parquet AS SELECT 1 col").queryExecution.executedPlan
Copy link
Contributor Author

@HeartSaVioR HeartSaVioR May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails with below error message:

org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);; 'CreateTable `t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists +- Project [1 AS col#545614]    +- OneRowRelation 

This test initially assumed the create table without USING is following native syntax, and it's no longer valid. Added USING parquet explicitly.

@@ -1005,7 +1005,7 @@ class AdaptiveQueryExecSuite
}
spark.sparkContext.addSparkListener(listener)
try {
sql("CREATE TABLE t1 AS SELECT 1 col").collect()
sql("CREATE TABLE t1 USING parquet AS SELECT 1 col").collect()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@SparkQA
Copy link

SparkQA commented May 13, 2020

Test build #122567 has finished for PR 28517 at commit 41d88ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented May 13, 2020

Build failure on #122567 is expected - two tests failed which are same as the latest build failure from #28500.

I think #122568 would not fail for two tests as I fixed them. Let's see.

@cloud-fan
Copy link
Contributor

just a reminder: we need to add back the removed test when we unify the CREATE TABLE syntax.

@SparkQA
Copy link

SparkQA commented May 13, 2020

Test build #122568 has finished for PR 28517 at commit 081866e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

@HeartSaVioR
Copy link
Contributor Author

Uh, also cc. to @dongjoon-hyun as he tried this first (though the approach was different) and I'm borrowing the JIRA issue here. (Would like to reopen the JIRA issue if we agree to merge this.)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 13, 2020

Thank you for pinging me. Unfortunately, I thought the decision was permanent at that time and closed it. So, Apache Spark JIRA policy doesn't allow us reopen SPARK-31136 technically. It's the same for all members (contributor and committer and PMC). Please use a new JIRA issue.

@HeartSaVioR
Copy link
Contributor Author

Ah OK. That was closed and cannot be reopened. I also wondered why there's no reopen button and I thought it requires privileges. Looks like that's not the case. Thanks for letting me know.

@HeartSaVioR HeartSaVioR changed the title [SPARK-31136][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax [SPARK-XXXXX][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax May 13, 2020
@HeartSaVioR
Copy link
Contributor Author

I'll file a new issue when we decide to merge this - as the discussion is on the way and someone might not agree about the resolution.

@cloud-fan
Copy link
Contributor

can you create a new ticket? I'm fine with this PR.

@HeartSaVioR HeartSaVioR changed the title [SPARK-XXXXX][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax [SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax May 13, 2020
@HeartSaVioR
Copy link
Contributor Author

Thanks! Filed a new issue SPARK-31707 and updated the title.

@@ -2195,22 +2195,4 @@ class DDLParserSuite extends AnalysisTest {
parsePlan("COMMENT ON TABLE a.b.c IS 'xYz'"),
CommentOnTable(UnresolvedTable(Seq("a", "b", "c")), "xYz"))
}

test("create table - without using") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after a second thought, can we keep the test but ignore it? We still need these tests after the syntax unification is done. We can add a comment to reference the JIRA ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for other removed tests.

Copy link
Member

@dongjoon-hyun dongjoon-hyun May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. +1 for @cloud-fan 's suggestion. (ignoring tests instead of removal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's straightforward for removed tests - I'll do that.

What about the tests being reverted as Hive create table is not integrated with DSv2?
(I'm referring sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala)
I guess they still need to be reverted instead of ignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Maybe ignoring tests and add TODO comments to them is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be simply changed to "ignore" instead of "test" as we're removing the config. I'll comment out instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just remove the conf after revert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but could you please elaborate? Removing config is what this patch proposed, and it's a part of revert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he meant to remove withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "false"). We don't need the config anyway, if we unify the syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see. That's one of valid approaches and good idea. Maybe then we need to remove some tests which depend on the config (two tests for on and off).

@@ -40,7 +40,7 @@ CREATE TABLE [ IF NOT EXISTS ] table_identifier
[ AS select_statement ]
```

Note that, the clauses between the USING clause and the AS SELECT clause can come in
Note that, the clauses between the OPTIONS clause and the AS SELECT clause can come in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should still be USING since OPTIONS itself can also come in any order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between is more likely be inclusive, otherwise previous statement was also wrong.

Copy link
Contributor

@cloud-fan cloud-fan May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between the USING clause and the AS SELECT clause

USING and AS SELECT are both not included, so this is still corrected. We shouldn't change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK my bad. I found USING shouldn't com in any order so previous statement was correct. Sorry about that.

@@ -2195,22 +2195,4 @@ class DDLParserSuite extends AnalysisTest {
parsePlan("COMMENT ON TABLE a.b.c IS 'xYz'"),
CommentOnTable(UnresolvedTable(Seq("a", "b", "c")), "xYz"))
}

test("create table - without using") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just remove the conf after revert?

@HeartSaVioR HeartSaVioR force-pushed the revert-SPARK-30098 branch from 081866e to 84c172f Compare May 15, 2020 05:20
@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #122649 has finished for PR 28517 at commit 84c172f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #122658 has finished for PR 28517 at commit 84c172f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

The test failure comes from HiveExternalCatalogVersionsSuite - Unable to download Spark 2.3.4.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #5040 has started for PR 28517 at commit 84c172f.

@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #122676 has finished for PR 28517 at commit 84c172f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #5041 has finished for PR 28517 at commit 84c172f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #5039 has finished for PR 28517 at commit 84c172f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented May 16, 2020

Test build #122696 has finished for PR 28517 at commit 84c172f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented May 16, 2020

Test build #122721 has finished for PR 28517 at commit 84c172f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request May 16, 2020
### What changes were proposed in this pull request?

It's quite annoying to be blocked by flaky tests in several PRs. This PR disables them. The tests come from 3 PRs I'm recently watching:
#28526
#28463
#28517

### Why are the changes needed?

To make PR builder more stable

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #28547 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request May 16, 2020
### What changes were proposed in this pull request?

It's quite annoying to be blocked by flaky tests in several PRs. This PR disables them. The tests come from 3 PRs I'm recently watching:
#28526
#28463
#28517

### Why are the changes needed?

To make PR builder more stable

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #28547 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 2012d58)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@cloud-fan cloud-fan closed this in d2bec5e May 17, 2020
cloud-fan pushed a commit that referenced this pull request May 17, 2020
…der for CREATE TABLE syntax

### What changes were proposed in this pull request?

This patch effectively reverts SPARK-30098 via below changes:

* Removed the config
* Removed the changes done in parser rule
* Removed the usage of config in tests
  * Removed tests which depend on the config
  * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098
* Reflect the change into docs (migration doc, create table syntax)

### Why are the changes needed?

SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change.

Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details.

### Does this PR introduce _any_ user-facing change?

No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases.

### How was this patch tested?

Existing UTs.

Closes #28517 from HeartSaVioR/revert-SPARK-30098.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d2bec5e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@HeartSaVioR
Copy link
Contributor Author

Thanks all for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the revert-SPARK-30098 branch May 17, 2020 12:52
dongjoon-hyun added a commit that referenced this pull request Apr 30, 2024
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: #26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: #28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: #30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: apache#26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: apache#28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: apache#30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants