Skip to content

[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Dec 2, 2019

What changes were proposed in this pull request?

In this PR, we propose to use the value of spark.sql.source.default as the provider for CREATE TABLE syntax instead of hive in Spark 3.0.

And to help the migration, we introduce a legacy conf spark.sql.legacy.respectHiveDefaultProvider.enabled and set its default to false.

Why are the changes needed?

  1. Currently, CREATE TABLE syntax use hive provider to create table while DataFrameWriter.saveAsTable API using the value of spark.sql.source.default as a provider to create table. It would be better to make them consistent.

  2. User may gets confused in some cases. For example:

CREATE TABLE t1 (c1 INT) USING PARQUET;
CREATE TABLE t2 (c1 INT);

In these two DDLs, use may think that t2 should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.

On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if spark.sql.hive.convertCATS=true:

CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE;
CREATE TABLE t4 AS SELECT 1 AS VALUE;

And these two cases together can be really confusing.

  1. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior.

Does this PR introduce any user-facing change?

Yes, before this PR, using CREATE TABLE syntax will use hive provider. But now, it use the value of spark.sql.source.default as its provider.

How was this patch tested?

Added tests in DDLParserSuite and HiveDDlSuite.

@Ngone51
Copy link
Member Author

Ngone51 commented Dec 2, 2019

cc @cloud-fan @gatorsmile

@@ -1960,6 +1960,15 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val LEGACY_RESPECT_HIVE_DEFAULT_PROVIDER_ENABLED =
buildConf("spark.sql.legacy.respectHiveDefaultProvider.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about spark.sql.legacy.createHiveTableByDefault.enabled

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114727 has finished for PR 26736 at commit 2451d2f.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@xy2953396112
Copy link
Contributor

fix Conflicting files.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114905 has finished for PR 26736 at commit d46d6db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114911 has finished for PR 26736 at commit cae3571.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114910 has finished for PR 26736 at commit fb4a186.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114912 has finished for PR 26736 at commit fc9b910.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -333,7 +333,7 @@ class DataSourceWithHiveMetastoreCatalogSuite
|SORTED BY (value)
|INTO 2 BUCKETS
|AS SELECT key, value, cast(key % 3 as string) as p FROM src
""".stripMargin)
""".stripMargin)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent.

@@ -48,6 +49,28 @@ class DDLParserSuite extends AnalysisTest {
comparePlans(parsePlan(sql), expected, checkAnalysis = false)
}

test("SPARK-30098: create table without provider should " +
"use default data source under non-legacy mode") {
withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this withSQLConf to show that it's the default behavior.

try {
TestHive.setConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED, true)
withTable("t1") {
val createTable = "CREATE TABLE `t1`(`a` STRUCT<`b`: STRING>)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just add using hive?

@SparkQA
Copy link

SparkQA commented Dec 6, 2019

Test build #114931 has finished for PR 26736 at commit a201307.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 6, 2019

Test build #114933 has finished for PR 26736 at commit a201307.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 58be82a Dec 6, 2019
@Ngone51
Copy link
Member Author

Ngone51 commented Dec 7, 2019

thanks a lot! @cloud-fan

dongjoon-hyun added a commit that referenced this pull request Apr 30, 2024
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: #26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: #28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: #30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: apache#26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: apache#28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: apache#30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants