Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jul 10, 2016

What changes were proposed in this pull request?

CreateHiveTableAsSelectLogicalPlan is a Hive-specific logical node. This is not a good design. We need to consolidate it with CreateTableUsingAsSelect to build a unified logical node CreateTableAsSelect.

The first step is to make more general the signature of CreateTableUsingAsSelect by using CatalogTable as the input of Table metadata. The logical node is renamed to CreateTableAsSelect. The new interface will be like

case class CreateTableAsSelect(
    tableDesc: CatalogTable,
    provider: String,
    mode: SaveMode,
    child: LogicalPlan) extends logical.UnaryNode 

The second step is to convert CreateHiveTableAsSelectLogicalPlan into CreateTableAsSelect.

This PR is based on the compare of the two interfaces. The details are described below.

Currently, the SQL interface is the only only entrance to CreateHiveTableAsSelectLogicalPlan. Below describes the correspondence between the SQL interface and CreateHiveTableAsSelectLogicalPlan

case class CreateHiveTableAsSelectLogicalPlan(
    tableDesc: CatalogTable,
    child: LogicalPlan,
    allowExisting: Boolean)
    extends UnaryNode with Command 
SQL:

When conf.convertCTAS == false || either [ROW FORMAT row_format] or [STORED AS file_format] is specified

  CREATE [EXTERNAL] [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col1[:] data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION path]
  [TBLPROPERTIES (property_name=property_value, ...)]
  [AS select_statement];

  -->

  [TEMPORARY] is not allowed.

  allowExisting: Boolean = [IF NOT EXISTS]
  child: LogicalPlan = select_statement
  tableDesc: CatalogTable = CatalogTable(
    identifier = [db_name.]table_name,
    tableType = [EXTERNAL],
    storage = [ROW FORMAT row_format +
              [STORED AS file_format] +
              [LOCATION path],
    schema = Seq.empty,
    partitionColumnNames = Seq.empty,
    properties = [TBLPROPERTIES (property_name=property_value, ...)],
    comment = [COMMENT table_comment])

CreateTableUsingAsSelect has three entrances. Below is the the correspondence:

case class CreateTableUsingAsSelect(
    tableIdent: TableIdentifier,
    provider: String,
    partitionColumns: Array[String],
    bucketSpec: Option[BucketSpec],
    mode: SaveMode,
    options: Map[String, String],
    child: LogicalPlan) extends logical.UnaryNode 
SQL Interface I:

When conf.convertCTAS == true && [ROW FORMAT row_format] and [STORED AS file_format] are not specified

  CREATE [EXTERNAL] [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col1[:] data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION path]
  [TBLPROPERTIES (property_name=property_value, ...)]
  [AS select_statement];

  --> 

  tableIdent: TableIdentifier = [db_name.]table_name,
  provider: String = conf.defaultDataSourceName,
  partitionColumns: Array[String] = Seq.empty,
  bucketSpec: Option[BucketSpec] = None,
  mode: SaveMode = [IF NOT EXISTS],
  options: Map[String, String] = [LOCATION path],
  child: LogicalPlan = [AS select_statement]
SQL Interface II:

  CREATE [EXTERNAL] [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col1[:] data_type [COMMENT col_comment], ...)]
  USING qualifiedName
  [OPTIONS tablePropertyList)]
  [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col3, ...) (SORTED BY orderedIdentifierList)? INTO INTEGER_VALUE BUCKETS]
  [AS select_statement];

  -->

  [EXTERNAL] is not allowed.
  [TEMPORARY] is not allowed.

  tableIdent: TableIdentifier = [db_name.]table_name,
  provider: String = USING qualifiedName,
  partitionColumns: Array[String] = [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)],
  bucketSpec: Option[BucketSpec] = [CLUSTERED BY (col3, ...) (SORTED BY orderedIdentifierList)? INTO INTEGER_VALUE BUCKETS],
  mode: SaveMode = [IF NOT EXISTS],
  options: Map[String, String] = [OPTIONS tablePropertyList)],
  child: LogicalPlan = [AS select_statement]
DataFrameWriter Interface:

  tableIdent: TableIdentifier = tableIdent (from saveAsTable API),
  provider: String = source (from format API),
  partitionColumns: Array[String] = partitioningColumns (from partitionBy API),
  bucketSpec: Option[BucketSpec] = getBucketSpec function (from bucketBy API and sortBy API),
  mode: SaveMode = mode (from mode API),
  options: Map[String, String] = extraOptions (from option and options API),
  child: LogicalPlan = df.logicalPlan (from DataFrameWriter)

How was this patch tested?

The existing test cases cover the code refactoring

@SparkQA
Copy link

SparkQA commented Jul 10, 2016

Test build #62047 has finished for PR 14123 at commit 5bef1e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

should we wait for #14071?

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62068 has finished for PR 14123 at commit 082040f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@cloud-fan Yeah! Will be in [WIP] until #14071 is merged.

@gatorsmile
Copy link
Member Author

This is part of #14482. Close it now

@gatorsmile gatorsmile closed this Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants