-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-6322][SQL] CTAS should consider the case where no file format or storage handler is given #5014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #28567 has finished for PR 5014 at commit
|
|
||
CreateTableAsSelect(db, tableName, nodeToPlan(query), allowExisting != None, Some(node)) | ||
if (ignores.exists(_ != None)) { | ||
CreateTableAsSelect(db, tableName, nodeToPlan(query), allowExisting != None, Some(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my bad I didn't drop a comment here.
Actually we are not using the ignores
, previously we were trying the parse the storage format etc. information within HiveQl.scala, however, the logic is quite complicated and error-prone, and then we decided to reuse the Hive code for that purpose, that's why we always pass down the node
(ASTNode) for further analysis. The node
is required to be passed down even without storage format specified, Hive will provide a default storage format for that during the analysis. So I don't think the change here is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Actually, ignores
could be None
here when no file format or storage handler is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without the storage handler specified, Hive still need the node
to get the default format.
See https://github.com/apache/spark/pull/5014/files#diff-ee66e11b56c21364760a5ed2b783f863L539
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it is None
and hive.convertCTAS
is true
, Hive CTAS statement will become a data source table with default data source. Please refer to CreateTables
in HiveMetastoreCatalog.scala
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it will fall into the default format specified by hive.conf.defaultDataSourceName
, which probably not the Hive default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CreateTables
, the node
is parsed by Hive's codes and see if file format or storage handler is given. If no, and hive.convertCTAS
is true, Hive CTAS statement will become a data source table with default data source, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel Please refer to #4639, which uses Parquet as the default file format for CTAS statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok , I see, thanks for the explanation, but what if hive.convertCTAS
is false and user didn't specify the storage format in HiveQl
? How can we get the default Hive storage format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In that case, the parsed CreateTableDesc
will be passed to execution.CreateTableAsSelect
.
@yhuai @chenghao-intel I updated it. Please take a look when you have time. Thanks! |
Test build #28769 has finished for PR 5014 at commit
|
@yhuai Any other comments? |
/cc @yhuai @liancheng Can you take a look of this if you have time? Thanks! |
Not super familiar with this part of code. But according to the context and discussion, I think this change makes sense. @yhuai Could you help confirming this? |
LGTM |
@@ -557,7 +557,6 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C | |||
"TOK_TABLEPROPERTIES"), | |||
children) | |||
val (db, tableName) = extractDbNameTableName(tableNameParts) | |||
|
|||
CreateTableAsSelect(db, tableName, nodeToPlan(query), allowExisting != None, Some(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the only place where we create a CreateTableAsSelect
? If so, can we get rid of the Option
and just pass the node
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it is. If we are sure that CreateTableAsSelect
is only used by Hive dialect, we can remove the Option
.
Test build #28999 has finished for PR 5014 at commit
|
@@ -142,7 +142,7 @@ case class CreateTableAsSelect[T]( | |||
tableName: String, | |||
child: LogicalPlan, | |||
allowExisting: Boolean, | |||
desc: Option[T] = None) extends UnaryNode { | |||
desc: T) extends UnaryNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of the type parameter and rename it to CreateHiveTableAsSelect
(be a little bit more specific on what this one does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CreateTableAsSelect
is designed as a common logical plan node, that's why I made the desc
as T
, and also the optional parameter. Otherwise, every SQL dialect will implements it's own CTAS
node(logical plan).
Or is the CreateTableUsingAsSelect
a more generic interface for the same purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CreateTableUsingAsSelect
is just for data source API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add a specific CreateHiveTableAsSelect
but still keepCreateTableAsSelect
as a common logical plan for the use of other SQL dialects.
…f CreateTableAsSelect.
Test build #29045 has finished for PR 5014 at commit
|
@yhuai Is this good to go now? |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
Test build #29164 has finished for PR 5014 at commit
|
@yhuai please take a look, thanks. |
@yhuai Can you check if this is ready to merge? |
@liancheng @yhuai this is quite a while for waiting merging, may you take a look? |
def tableName: String | ||
def child: LogicalPlan | ||
def allowExisting: Boolean | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this trait here, can we just make the implementation below a Command
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we cannot use Command
because a Command
is a LeafNode
. If we make it a Command
, the child logical plan will not be analyzed and optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it will be good to have a different type of Command
that is not a LeafNode
.
One minor comment otherwise this LGTM. ping @yhuai |
Test build #30106 has finished for PR 5014 at commit
|
ping @yhuai |
Test build #31053 has started for PR 5014 at commit |
When creating
CreateTableAsSelect
inHiveQl
, it doesn't consider the case where no file format or storage handler is given. So later inCreateTables
, one ofCreateTableAsSelect
cases will never be run. The twoCreateTableAsSelect
cases are basically the same codes except for checkingCreateTableDesc
. This pr fixes this issue.