-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736
Conversation
@@ -1960,6 +1960,15 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val LEGACY_RESPECT_HIVE_DEFAULT_PROVIDER_ENABLED = | |||
buildConf("spark.sql.legacy.respectHiveDefaultProvider.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about spark.sql.legacy.createHiveTableByDefault.enabled
Test build #114727 has finished for PR 26736 at commit
|
fix Conflicting files. |
Test build #114905 has finished for PR 26736 at commit
|
Test build #114911 has finished for PR 26736 at commit
|
Test build #114910 has finished for PR 26736 at commit
|
Test build #114912 has finished for PR 26736 at commit
|
@@ -333,7 +333,7 @@ class DataSourceWithHiveMetastoreCatalogSuite | |||
|SORTED BY (value) | |||
|INTO 2 BUCKETS | |||
|AS SELECT key, value, cast(key % 3 as string) as p FROM src | |||
""".stripMargin) | |||
""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent.
@@ -48,6 +49,28 @@ class DDLParserSuite extends AnalysisTest { | |||
comparePlans(parsePlan(sql), expected, checkAnalysis = false) | |||
} | |||
|
|||
test("SPARK-30098: create table without provider should " + | |||
"use default data source under non-legacy mode") { | |||
withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this withSQLConf
to show that it's the default behavior.
try { | ||
TestHive.setConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED, true) | ||
withTable("t1") { | ||
val createTable = "CREATE TABLE `t1`(`a` STRUCT<`b`: STRING>)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just add using hive
?
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala
Show resolved
Hide resolved
Test build #114931 has finished for PR 26736 at commit
|
retest this please |
Test build #114933 has finished for PR 26736 at commit
|
thanks, merging to master! |
thanks a lot! @cloud-fan |
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
Show resolved
Hide resolved
… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: #26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: #28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: #30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: apache#26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: apache#28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: apache#30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
In this PR, we propose to use the value of
spark.sql.source.default
as the provider forCREATE TABLE
syntax instead ofhive
in Spark 3.0.And to help the migration, we introduce a legacy conf
spark.sql.legacy.respectHiveDefaultProvider.enabled
and set its default tofalse
.Why are the changes needed?
Currently,
CREATE TABLE
syntax use hive provider to create table whileDataFrameWriter.saveAsTable
API using the value ofspark.sql.source.default
as a provider to create table. It would be better to make them consistent.User may gets confused in some cases. For example:
In these two DDLs, use may think that
t2
should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if
spark.sql.hive.convertCATS=true
:And these two cases together can be really confusing.
Does this PR introduce any user-facing change?
Yes, before this PR, using
CREATE TABLE
syntax will use hive provider. But now, it use the value ofspark.sql.source.default
as its provider.How was this patch tested?
Added tests in
DDLParserSuite
andHiveDDlSuite
.