Skip to content

[SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append #16552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

windpiger
Copy link
Contributor

@windpiger windpiger commented Jan 11, 2017

What changes were proposed in this pull request?

After SPARK-19107, we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support.

This PR implement:
DataFrameWriter.saveAsTable work with hive format with append mode

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71220 has finished for PR 16552 at commit 25b39fa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71239 has finished for PR 16552 at commit b463ac7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger changed the title [WIP][SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append Jan 16, 2017
@windpiger
Copy link
Contributor Author

retest this please

@windpiger windpiger changed the title [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append [WIP][SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append Jan 16, 2017
@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71443 has finished for PR 16552 at commit 29e1ee2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71444 has finished for PR 16552 at commit 429a0ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger changed the title [WIP][SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append Jan 18, 2017
@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71599 has finished for PR 16552 at commit 21c5e3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The overall idea is to use InsertIntable to implement appending to hive table, but this approach is too hacky, we should follow the way how we deal with data source table, e.g. DataFrameWriter.saveAsTable just build a CreateTable plan, rule AnalyzeCreateTable do some checking and normalization, and another rule turn CreateTable into CreateDataSourceTableAsSelectCommand.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71639 has started for PR 16552 at commit 2bf67c7.

@windpiger windpiger changed the title [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append [WIP][SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append Jan 19, 2017
@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71654 has finished for PR 16552 at commit 0b9dc3a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class QualifiedTableName(database: String, name: String)
  • class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71652 has finished for PR 16552 at commit 6b8f625.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71653 has finished for PR 16552 at commit 1145e52.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@windpiger windpiger changed the title [WIP][SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hive append Jan 19, 2017
@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71659 has finished for PR 16552 at commit 2f542ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Check if the specified data source match the data source of the existing table.
val existingProvider = DataSource.lookupDataSource(existingTable.provider.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have HiveFileFormat, and we can make it implement DataSourceRegister, then DataSource.lookupDataSource("hive") can work.

@@ -69,7 +68,7 @@ case class CreateHiveTableAsSelectCommand(
withFormat
}

sparkSession.sessionState.catalog.createTable(withSchema, ignoreIfExists = false)
sparkSession.sessionState.catalog.createTable(withSchema, ignoreIfExists = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we don't need to build withSchema anymore, the schema will be set in AnalyzeCreateTable

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71886 has finished for PR 16552 at commit 6c09477.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveFileFormat(fileSinkDesc: FileSinkDesc)

@gatorsmile
Copy link
Member

You need to fetch the upstream and merge it to your local branch. Some changes were made and merged to the upstream/master, although they did not introduce the conflicts. The changes caused the compilation errors in your PR.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71888 has finished for PR 16552 at commit 98ec55a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveFileFormat(fileSinkConf: FileSinkDesc)

|USING hive
""".stripMargin)
val tempView = spark.sessionState.catalog.getTempView(tableName)
assert(tempView.isDefined, "create a temp view using hive should success")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, it's not expected. let's add a check in CreateTempViewUsing, and throw exception for hive provider, e.g. if (DDLUtils.isHiveTable(t)) throw ...

"not supported yet. Please use the insertInto() API as an alternative.")
}

// Check if the specified data source match the data source of the existing table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this line?

@@ -65,6 +65,10 @@ case class CreateTempViewUsing(
}

def run(sparkSession: SparkSession): Seq[Row] = {
if (provider.toLowerCase == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Currently Hive data source can not be created as a view")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive data source can only be used with tables, you cannot use it with CREATE TEMP VIEW USING

@@ -1461,6 +1461,25 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
})
}

test("run sql directly on files - hive") {
withTable("t") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to create a table

withTempPath { path =>
  spark.range(100).toDF.write.parquet(path.getAbsolutePath)
  ...
  sql(s"select id from hive.`${path.getAbsolutePath}`")
}

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71918 has started for PR 16552 at commit 7bf5b50.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71920 has started for PR 16552 at commit 7bf5b50.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71910 has finished for PR 16552 at commit f34ab6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@@ -65,6 +65,11 @@ case class CreateTempViewUsing(
}

def run(sparkSession: SparkSession): Seq[Row] = {
if (provider.toLowerCase == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can not be used with tables," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can only be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please add a space after ,

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71923 has finished for PR 16552 at commit 7bf5b50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71925 has finished for PR 16552 at commit 59db8e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 3c86fdd Jan 24, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

After [SPARK-19107](https://issues.apache.org/jira/browse/SPARK-19107), we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support.

This PR implement:
DataFrameWriter.saveAsTable work with hive format with append mode

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes apache#16552 from windpiger/saveAsTableWithHiveAppend.
ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 29, 2017
## What changes were proposed in this pull request?

After apache#16552 , `CreateHiveTableAsSelectCommand` becomes very similar to `CreateDataSourceTableAsSelectCommand`, and we can further simplify it by only creating table in the table-not-exist branch.

This PR also adds hive provider checking in DataStream reader/writer, which is missed in apache#16552

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16693 from cloud-fan/minor.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

After [SPARK-19107](https://issues.apache.org/jira/browse/SPARK-19107), we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support.

This PR implement:
DataFrameWriter.saveAsTable work with hive format with append mode

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes apache#16552 from windpiger/saveAsTableWithHiveAppend.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

After apache#16552 , `CreateHiveTableAsSelectCommand` becomes very similar to `CreateDataSourceTableAsSelectCommand`, and we can further simplify it by only creating table in the table-not-exist branch.

This PR also adds hive provider checking in DataStream reader/writer, which is missed in apache#16552

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16693 from cloud-fan/minor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants