-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #70983 has finished for PR 16487 at commit
|
How about the Seq(1 -> "a").toDF("i", "j").write.format("hive").save() We will get the following error:
|
"t", | ||
"hive", | ||
new StructType().add("i", "int"), | ||
Map("path" -> dir.getCanonicalPath, "fileFormat" -> "parquet")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If path
is not provided, it still works. However, based on our latest design decision, users must provide path
when they creating an external table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the design decision, we want to hide the managed/external concept from users. I not sure if we want to rename this API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, just issue an exception when users do not provide a path
? Otherwise, we have to add new APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll address this problem in a follow-up PR, other data source also have this problem, e.g. users can create an external parquet table without path, so this PR doesn't introduce new problems.
@@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { | |||
} | |||
EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { | |||
// Only do the check if the table is a data source table (the relation is a BaseRelation). | |||
// TODO(cloud-fan): also check hive table relation here when we support overwrite mode | |||
// for creating hive tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are also facing the same issue in the insertInto(tableIdent: TableIdentifier)
API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insertInto
is different, it generates InsertIntoTable
plan instead of CreateTable
plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq((1, 2)).toDF("i", "j").write.format("parquet").saveAsTable(tableName)
table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)
We captured the exceptions when the format is parquet
. Now, when the format is hive
, should we do the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameWriter.insertInto
will ignore the specified provider, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we ignore the specified provider, we still respect the actual format of the table. For example, below is the Hive table. We are not blocking it. Should we block it to make them consistent?
sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS key, 'abc' AS value")
val df = sql(s"SELECT key, value FROM $tableName")
df.write.mode("overwrite").insertInto(tableName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not block it. This generates InsertIntoTable
, and it supports hive table. What we should block is saveAsTable
with Overwrite
mode, which generates CreateTable
.
insert overwrite
is different from create table with overwrite mode
Test build #71110 has finished for PR 16487 at commit
|
@@ -1169,26 +1169,6 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv | |||
} | |||
} | |||
|
|||
test("save API - format hive") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset.ofRows(sparkSession, | ||
sparkSession.sessionState.catalog.lookupRelation( | ||
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))) | ||
sparkSession.table(tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
LGTM pending test |
LGTM |
Test build #71112 has finished for PR 16487 at commit
|
Test build #71120 has finished for PR 16487 at commit
|
thanks for the review, merging to master! |
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16487 from cloud-fan/hive-table.
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16487 from cloud-fan/hive-table.
What changes were proposed in this pull request?
After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with
DataFrameWriter
andCatalog
now.This PR basically just removes the hive provider check in
DataFrameWriter.saveAsTable
andCatalog.createExternalTable
, and add tests.How was this patch tested?
new tests in
HiveDDLSuite