-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-19152][SQL][followup] simplify CreateHiveTableAsSelectCommand #16693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #71933 has finished for PR 16693 at commit
|
@@ -221,6 +222,11 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) { | |||
* @since 2.0.0 | |||
*/ | |||
def start(): StreamingQuery = { | |||
if (source.toLowerCase == DDLUtils.HIVE_PROVIDER) { | |||
throw new AnalysisException("Hive data source can only be used with tables, you can not " + | |||
"read files of Hive data source directly.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not to read but write the results to Hive tables, right?
@@ -116,6 +117,11 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo | |||
* @since 2.0.0 | |||
*/ | |||
def load(): DataFrame = { | |||
if (source.toLowerCase == DDLUtils.HIVE_PROVIDER) { | |||
throw new AnalysisException("Hive data source can only be used with tables, you can not " + | |||
"write files of Hive data source directly.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to read the streaming data from Hive tables, right? I think we need to fix the error message.
@@ -89,12 +55,30 @@ case class CreateHiveTableAsSelectCommand( | |||
// Since the table already exists and the save mode is Ignore, we will just return. | |||
return Seq.empty | |||
} | |||
sparkSession.sessionState.executePlan(InsertIntoTable( | |||
metastoreRelation, Map(), query, overwrite = false, ifNotExists = false)).toRdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh... Previously, we try to create the table even if the table still exists. A good change!
val withSchema = if (withFormat.schema.isEmpty) { | ||
tableDesc.copy(schema = query.schema) | ||
} else { | ||
withFormat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the other reviewers, this is not needed, because the schema is always empty when we need to create a table. See the assert here..
tableDesc.storage.outputFormat | ||
.orElse(Some(classOf[HiveIgnoreKeyTextOutputFormat[Text, Text]].getName)), | ||
serde = tableDesc.storage.serde.orElse(Some(classOf[LazySimpleSerDe].getName)), | ||
compressed = tableDesc.storage.compressed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after the code refactoring, this is always ensured in the rule DetermineHiveSerde
.
LGTM except two minor comments in the error messages. |
@gatorsmile thanks for adding comments about why the cleanup is safe! |
Test build #72022 has finished for PR 16693 at commit
|
retest this please |
1 similar comment
retest this please |
ok to test |
Test build #72109 has finished for PR 16693 at commit
|
Thanks! Merging to master. |
## What changes were proposed in this pull request? After apache#16552 , `CreateHiveTableAsSelectCommand` becomes very similar to `CreateDataSourceTableAsSelectCommand`, and we can further simplify it by only creating table in the table-not-exist branch. This PR also adds hive provider checking in DataStream reader/writer, which is missed in apache#16552 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16693 from cloud-fan/minor.
What changes were proposed in this pull request?
After #16552 ,
CreateHiveTableAsSelectCommand
becomes very similar toCreateDataSourceTableAsSelectCommand
, and we can further simplify it by only creating table in the table-not-exist branch.This PR also adds hive provider checking in DataStream reader/writer, which is missed in #16552
How was this patch tested?
N/A