Skip to content

[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with DataFrameWriter and Catalog now.

This PR basically just removes the hive provider check in DataFrameWriter.saveAsTable and Catalog.createExternalTable, and add tests.

How was this patch tested?

new tests in HiveDDLSuite

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70983 has finished for PR 16487 at commit c83f663.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

How about the save() API?

Seq(1 -> "a").toDF("i", "j").write.format("hive").save()

We will get the following error:

Failed to find data source: hive. Please find packages at http://spark.apache.org/third-party-projects.html
java.lang.ClassNotFoundException: Failed to find data source: hive. Please find packages at http://spark.apache.org/third-party-projects.html

"t",
"hive",
new StructType().add("i", "int"),
Map("path" -> dir.getCanonicalPath, "fileFormat" -> "parquet"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If path is not provided, it still works. However, based on our latest design decision, users must provide path when they creating an external table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the design decision, we want to hide the managed/external concept from users. I not sure if we want to rename this API...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, just issue an exception when users do not provide a path? Otherwise, we have to add new APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll address this problem in a follow-up PR, other data source also have this problem, e.g. users can create an external parquet table without path, so this PR doesn't introduce new problems.

@@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
}
EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {
// Only do the check if the table is a data source table (the relation is a BaseRelation).
// TODO(cloud-fan): also check hive table relation here when we support overwrite mode
// for creating hive tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also facing the same issue in the insertInto(tableIdent: TableIdentifier) API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insertInto is different, it generates InsertIntoTable plan instead of CreateTable plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      Seq((1, 2)).toDF("i", "j").write.format("parquet").saveAsTable(tableName)
      table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)

We captured the exceptions when the format is parquet. Now, when the format is hive, should we do the same thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameWriter.insertInto will ignore the specified provider, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we ignore the specified provider, we still respect the actual format of the table. For example, below is the Hive table. We are not blocking it. Should we block it to make them consistent?

      sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS key, 'abc' AS value")
      val df = sql(s"SELECT key, value FROM $tableName")
      df.write.mode("overwrite").insertInto(tableName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not block it. This generates InsertIntoTable, and it supports hive table. What we should block is saveAsTable with Overwrite mode, which generates CreateTable.

insert overwrite is different from create table with overwrite mode

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71110 has finished for PR 16487 at commit 6209d04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1169,26 +1169,6 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
}
}

test("save API - format hive") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.ofRows(sparkSession,
sparkSession.sessionState.catalog.lookupRelation(
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)))
sparkSession.table(tableName)
Copy link
Member

@gatorsmile gatorsmile Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gatorsmile
Copy link
Member

LGTM pending test

@yhuai
Copy link
Contributor

yhuai commented Jan 10, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71112 has finished for PR 16487 at commit 27d97e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71120 has finished for PR 16487 at commit 9c48f93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in b0319c2 Jan 10, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…nd Catalog

## What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now.

This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests.

## How was this patch tested?

new tests in `HiveDDLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16487 from cloud-fan/hive-table.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…nd Catalog

## What changes were proposed in this pull request?

After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now.

This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests.

## How was this patch tested?

new tests in `HiveDDLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16487 from cloud-fan/hive-table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants