Skip to content

[SPARK-19152][SQL][followup] simplify CreateHiveTableAsSelectCommand #16693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

After #16552 , CreateHiveTableAsSelectCommand becomes very similar to CreateDataSourceTableAsSelectCommand, and we can further simplify it by only creating table in the table-not-exist branch.

This PR also adds hive provider checking in DataStream reader/writer, which is missed in #16552

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @windpiger

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71933 has finished for PR 16693 at commit db00cf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -221,6 +222,11 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {
* @since 2.0.0
*/
def start(): StreamingQuery = {
if (source.toLowerCase == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"read files of Hive data source directly.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not to read but write the results to Hive tables, right?

@@ -116,6 +117,11 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* @since 2.0.0
*/
def load(): DataFrame = {
if (source.toLowerCase == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"write files of Hive data source directly.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to read the streaming data from Hive tables, right? I think we need to fix the error message.

@@ -89,12 +55,30 @@ case class CreateHiveTableAsSelectCommand(
// Since the table already exists and the save mode is Ignore, we will just return.
return Seq.empty
}
sparkSession.sessionState.executePlan(InsertIntoTable(
metastoreRelation, Map(), query, overwrite = false, ifNotExists = false)).toRdd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh... Previously, we try to create the table even if the table still exists. A good change!

val withSchema = if (withFormat.schema.isEmpty) {
tableDesc.copy(schema = query.schema)
} else {
withFormat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the other reviewers, this is not needed, because the schema is always empty when we need to create a table. See the assert here..

tableDesc.storage.outputFormat
.orElse(Some(classOf[HiveIgnoreKeyTextOutputFormat[Text, Text]].getName)),
serde = tableDesc.storage.serde.orElse(Some(classOf[LazySimpleSerDe].getName)),
compressed = tableDesc.storage.compressed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after the code refactoring, this is always ensured in the rule DetermineHiveSerde.

@gatorsmile
Copy link
Member

LGTM except two minor comments in the error messages.

@cloud-fan
Copy link
Contributor Author

@gatorsmile thanks for adding comments about why the cleanup is safe!

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72022 has finished for PR 16693 at commit f4a9342.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

1 similar comment
@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 28, 2017

Test build #72109 has finished for PR 16693 at commit f4a9342.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in f7c07db Jan 29, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

After apache#16552 , `CreateHiveTableAsSelectCommand` becomes very similar to `CreateDataSourceTableAsSelectCommand`, and we can further simplify it by only creating table in the table-not-exist branch.

This PR also adds hive provider checking in DataStream reader/writer, which is missed in apache#16552

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16693 from cloud-fan/minor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants