Skip to content

[SPARK-6322][SQL] CTAS should consider the case where no file format or storage handler is given #5014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 13, 2015

When creating CreateTableAsSelect in HiveQl, it doesn't consider the case where no file format or storage handler is given. So later in CreateTables, one of CreateTableAsSelect cases will never be run. The two CreateTableAsSelect cases are basically the same codes except for checking CreateTableDesc. This pr fixes this issue.

@SparkQA
Copy link

SparkQA commented Mar 13, 2015

Test build #28567 has finished for PR 5014 at commit d3c9f6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


CreateTableAsSelect(db, tableName, nodeToPlan(query), allowExisting != None, Some(node))
if (ignores.exists(_ != None)) {
CreateTableAsSelect(db, tableName, nodeToPlan(query), allowExisting != None, Some(node))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my bad I didn't drop a comment here.
Actually we are not using the ignores, previously we were trying the parse the storage format etc. information within HiveQl.scala, however, the logic is quite complicated and error-prone, and then we decided to reuse the Hive code for that purpose, that's why we always pass down the node (ASTNode) for further analysis. The node is required to be passed down even without storage format specified, Hive will provide a default storage format for that during the analysis. So I don't think the change here is reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Actually, ignores could be None here when no file format or storage handler is given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without the storage handler specified, Hive still need the node to get the default format.
See https://github.com/apache/spark/pull/5014/files#diff-ee66e11b56c21364760a5ed2b783f863L539

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it is None and hive.convertCTAS is true, Hive CTAS statement will become a data source table with default data source. Please refer to CreateTables in HiveMetastoreCatalog.scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it will fall into the default format specified by hive.conf.defaultDataSourceName, which probably not the Hive default behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CreateTables, the node is parsed by Hive's codes and see if file format or storage handler is given. If no, and hive.convertCTAS is true, Hive CTAS statement will become a data source table with default data source, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel Please refer to #4639, which uses Parquet as the default file format for CTAS statements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok , I see, thanks for the explanation, but what if hive.convertCTAS is false and user didn't specify the storage format in HiveQl? How can we get the default Hive storage format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In that case, the parsed CreateTableDesc will be passed to execution.CreateTableAsSelect.

@viirya
Copy link
Member Author

viirya commented Mar 18, 2015

@yhuai @chenghao-intel I updated it. Please take a look when you have time. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28769 has finished for PR 5014 at commit 5b45999.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 20, 2015

@yhuai Any other comments?

@viirya
Copy link
Member Author

viirya commented Mar 23, 2015

/cc @yhuai @liancheng Can you take a look of this if you have time? Thanks!

@liancheng
Copy link
Contributor

Not super familiar with this part of code. But according to the context and discussion, I think this change makes sense. @yhuai Could you help confirming this?

@chenghao-intel
Copy link
Contributor

LGTM

@@ -557,7 +557,6 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
"TOK_TABLEPROPERTIES"),
children)
val (db, tableName) = extractDbNameTableName(tableNameParts)

CreateTableAsSelect(db, tableName, nodeToPlan(query), allowExisting != None, Some(node))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the only place where we create a CreateTableAsSelect? If so, can we get rid of the Option and just pass the node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it is. If we are sure that CreateTableAsSelect is only used by Hive dialect, we can remove the Option.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28999 has finished for PR 5014 at commit 5b611cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -142,7 +142,7 @@ case class CreateTableAsSelect[T](
tableName: String,
child: LogicalPlan,
allowExisting: Boolean,
desc: Option[T] = None) extends UnaryNode {
desc: T) extends UnaryNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of the type parameter and rename it to CreateHiveTableAsSelect (be a little bit more specific on what this one does).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CreateTableAsSelect is designed as a common logical plan node, that's why I made the desc as T, and also the optional parameter. Otherwise, every SQL dialect will implements it's own CTAS node(logical plan).
Or is the CreateTableUsingAsSelect a more generic interface for the same purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CreateTableUsingAsSelect is just for data source API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a specific CreateHiveTableAsSelect but still keepCreateTableAsSelect as a common logical plan for the use of other SQL dialects.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29045 has finished for PR 5014 at commit 4389607.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CreateTableAsSelect extends UnaryNode

@liancheng
Copy link
Contributor

@yhuai Is this good to go now?

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29164 has finished for PR 5014 at commit d5fcf67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CreateTableAsSelect extends UnaryNode

@viirya
Copy link
Member Author

viirya commented Mar 26, 2015

@yhuai please take a look, thanks.

@viirya
Copy link
Member Author

viirya commented Mar 31, 2015

@yhuai Can you check if this is ready to merge?

@viirya
Copy link
Member Author

viirya commented Apr 7, 2015

@liancheng @yhuai this is quite a while for waiting merging, may you take a look?

def tableName: String
def child: LogicalPlan
def allowExisting: Boolean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this trait here, can we just make the implementation below a Command?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we cannot use Command because a Command is a LeafNode. If we make it a Command, the child logical plan will not be analyzed and optimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it will be good to have a different type of Command that is not a LeafNode.

@marmbrus
Copy link
Contributor

One minor comment otherwise this LGTM. ping @yhuai

@SparkQA
Copy link

SparkQA commented Apr 12, 2015

Test build #30106 has finished for PR 5014 at commit 7d82ede.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CreateTableAsSelect extends UnaryCommand
    • abstract class UnaryCommand extends UnaryNode
  • This patch does not change any dependencies.

@viirya
Copy link
Member Author

viirya commented Apr 13, 2015

ping @yhuai

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31053 has started for PR 5014 at commit 7d82ede.

@viirya viirya closed this May 21, 2015
@viirya viirya deleted the fix_hive_ctas branch December 27, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants