[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487

cloud-fan · 2017-01-06T15:40:27Z

What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with DataFrameWriter and Catalog now.

This PR basically just removes the hive provider check in DataFrameWriter.saveAsTable and Catalog.createExternalTable, and add tests.

How was this patch tested?

new tests in HiveDDLSuite

cloud-fan · 2017-01-06T15:40:41Z

cc @yhuai @gatorsmile

SparkQA · 2017-01-06T17:09:25Z

Test build #70983 has finished for PR 16487 at commit c83f663.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-06T23:18:57Z

How about the save() API?

Seq(1 -> "a").toDF("i", "j").write.format("hive").save()

We will get the following error:

Failed to find data source: hive. Please find packages at http://spark.apache.org/third-party-projects.html
java.lang.ClassNotFoundException: Failed to find data source: hive. Please find packages at http://spark.apache.org/third-party-projects.html

gatorsmile · 2017-01-06T23:41:47Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+          "t",
+          "hive",
+          new StructType().add("i", "int"),
+          Map("path" -> dir.getCanonicalPath, "fileFormat" -> "parquet"))


If path is not provided, it still works. However, based on our latest design decision, users must provide path when they creating an external table.

in the design decision, we want to hide the managed/external concept from users. I not sure if we want to rename this API...

Maybe, just issue an exception when users do not provide a path? Otherwise, we have to add new APIs.

I'll address this problem in a follow-up PR, other data source also have this problem, e.g. users can create an external parquet table without path, so this PR doesn't introduce new problems.

gatorsmile · 2017-01-06T23:44:11Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
        }
        EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {
          // Only do the check if the table is a data source table (the relation is a BaseRelation).
+          // TODO(cloud-fan): also check hive table relation here when we support overwrite mode
+          // for creating hive tables.


We are also facing the same issue in the insertInto(tableIdent: TableIdentifier) API?

insertInto is different, it generates InsertIntoTable plan instead of CreateTable plan.

Seq((1, 2)).toDF("i", "j").write.format("parquet").saveAsTable(tableName) table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)

We captured the exceptions when the format is parquet. Now, when the format is hive, should we do the same thing?

DataFrameWriter.insertInto will ignore the specified provider, isn't it?

Although we ignore the specified provider, we still respect the actual format of the table. For example, below is the Hive table. We are not blocking it. Should we block it to make them consistent?

sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS key, 'abc' AS value") val df = sql(s"SELECT key, value FROM $tableName") df.write.mode("overwrite").insertInto(tableName)

We should not block it. This generates InsertIntoTable, and it supports hive table. What we should block is saveAsTable with Overwrite mode, which generates CreateTable.

insert overwrite is different from create table with overwrite mode

SparkQA · 2017-01-10T05:22:21Z

Test build #71110 has finished for PR 16487 at commit 6209d04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-10T05:48:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

@@ -1169,26 +1169,6 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
    }
  }

-  test("save API - format hive") {


already covered by https://github.com/apache/spark/pull/16487/files#diff-b7094baa12601424a5d19cb930e3402fR1356

gatorsmile · 2017-01-10T06:31:56Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-    Dataset.ofRows(sparkSession,
-      sparkSession.sessionState.catalog.lookupRelation(
-        sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)))
+    sparkSession.table(tableName)


gatorsmile · 2017-01-10T06:44:48Z

LGTM pending test

yhuai · 2017-01-10T07:13:37Z

LGTM

SparkQA · 2017-01-10T07:33:24Z

Test build #71112 has finished for PR 16487 at commit 27d97e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-10T11:15:56Z

Test build #71120 has finished for PR 16487 at commit 9c48f93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-10T11:27:15Z

thanks for the review, merging to master!

…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16487 from cloud-fan/hive-table.

gatorsmile reviewed Jan 6, 2017

View reviewed changes

support creating hive table with DataFrameWriter and Catalog

4e06830

cloud-fan force-pushed the hive-table branch from c83f663 to 6209d04 Compare January 10, 2017 03:47

address comments

27d97e5

cloud-fan force-pushed the hive-table branch from 6209d04 to 27d97e5 Compare January 10, 2017 05:43

cloud-fan commented Jan 10, 2017

View reviewed changes

gatorsmile reviewed Jan 10, 2017

View reviewed changes

fix test

9c48f93

asfgit closed this in b0319c2 Jan 10, 2017

[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487

[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487

Uh oh!

Conversation

cloud-fan commented Jan 6, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jan 6, 2017

Uh oh!

SparkQA commented Jan 6, 2017

Uh oh!

gatorsmile commented Jan 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 10, 2017

Uh oh!

yhuai commented Jan 10, 2017

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

cloud-fan commented Jan 10, 2017

Uh oh!

Uh oh!

gatorsmile Jan 10, 2017 •

edited

Loading