Skip to content

[SPARK-25006][SQL] Add CatalogTableIdentifier. #21978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 2, 2018

What changes were proposed in this pull request?

This adds CatalogTableIdentifier, which is an identifier that consists of a triple: catalog, database, and table. Catalog and database are optional.

The existing TableIdentifier class extends CatalogTableIdentifier and is guarateed to have no defined catalog component. Classes that expect a TableIdentifier will continue to use TableIdentifier to ensure that catalogs are not leaked into code paths that do not support them.

This adds a parser rule, catalogTableIdentifier, that can parse identifiers with a catalog. An identifier with only two components will match database and table, leaving the catalog undefined. Only identifiers with three components will have a defined catalog. In addition, rules must be re-written to support catalogTableIdentifier. Existing rules will continue to use tableIdentifier with no catalog.

How was this patch tested?

Existing tests. This should not change any behavior.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 2, 2018

@gatorsmile and @cloud-fan, this adds catalog to TableIdentifier in preparation for multi-catalog support. TableIdentifier continues to work as-is to ensure that there are no behavior changes in code paths that do not have catalog support. I've updated UnresolvedRelation to demonstrate how migration to CatalogTableIdentifier will work.

@SparkQA

This comment has been minimized.

@rdblue rdblue force-pushed the SPARK-25006-add-catalog-to-table-identifier branch from d61e0f1 to c0683a8 Compare August 2, 2018 21:51
@SparkQA

This comment has been minimized.

@rdblue rdblue force-pushed the SPARK-25006-add-catalog-to-table-identifier branch 2 times, most recently from 7abc8b8 to f34e23b Compare August 2, 2018 22:25
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@rdblue rdblue force-pushed the SPARK-25006-add-catalog-to-table-identifier branch from f34e23b to 6fe2d07 Compare August 2, 2018 23:00
@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 3, 2018

Retest this please.

@SparkQA

This comment has been minimized.

@rdblue rdblue force-pushed the SPARK-25006-add-catalog-to-table-identifier branch from 6fe2d07 to 00295ee Compare August 4, 2018 00:06
@rdblue
Copy link
Contributor Author

rdblue commented Aug 4, 2018

FYI @jzhuge

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 5, 2018

Retest this please.

@SparkQA

This comment has been minimized.

@cloud-fan
Copy link
Contributor

I'd like to wait for #17185

#17185 allows users to do db1.table1.col1, and we can later extend it to catalog1.db1.table1.col1.

We should also update the column resolution logic to consider catalog name.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 7, 2018

@cloud-fan, that's fine with me since #17185 is already merged. Would this conflict with #17185? We can just add a case that detects whether the first identifier in the seq is a catalog when updating expressions.

This PR is just the start for adding catalog to table identifiers. None of the SQL statements are modified in this PR on purpose: code paths will need to be updated to support CatalogTableIdentifier. This introduces the class so we can use type safety to ensure that CatalogTableIdentifier doens't leak into the code paths that don't support it.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94375 has finished for PR 21978 at commit 00295ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedRelation(table: CatalogTableIdentifier) extends LeafNode
  • sealed trait IdentifierWithOptionalDatabaseAndCatalog
  • case class CatalogTableIdentifier(table: String, database: Option[String], catalog: Option[String])
  • class TableIdentifier(table: String, db: Option[String])

@rdblue
Copy link
Contributor Author

rdblue commented Aug 8, 2018

@cloud-fan, when do you think we can get this in? It doesn't need to go in 2.4 because it doesn't change any read or write paths -- nothing uses CatalogTableIdentifier yet -- but it would be great to get it into master so we can start building paths that do support CatalogTableIdentifier.

@rdblue rdblue changed the title SPARK-25006: Add CatalogTableIdentifier. [SPARK-25006][SQL] Add CatalogTableIdentifier. Aug 30, 2018
@mccheah
Copy link
Contributor

mccheah commented Nov 22, 2018

Wanted to follow up here - are we planning on merging this or are there more things we need to discuss?

Copy link
Contributor

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, just some minor comments.

@@ -18,48 +18,106 @@
package org.apache.spark.sql.catalyst

/**
* An identifier that optionally specifies a database.
* An identifier that optionally specifies a database and catalog.
*
* Format (unquoted): "name" or "db.name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update formats in these scaladocs.

* unquotedString as the function name.
* Identifies a table in a database and catalog.
* If `database` is not defined, the current catalog's default database is used.
* If `catalog` is not defined, the current catalog is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"current" meaning "global"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we want to move away from a special global catalog. I think that Spark should have a current catalog, like a current database, which is used to resolve references that don't have an explicit catalog. That would have a default, just like the current database has a default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. When we add the logical side of leveraging catalogs we can revisit the API of how to set the current catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This introduces the ability to expose a catalog to Spark. It doesn't actually add any user-facing operations.

This adds CatalogTableIdentifier, which is an identifier that consists
of a triple: catalog, database, and table. Catalog and database are
optional.

The existing TableIdentifier class extends CatalogTableIdentifier and is
guarateed to have no defined catalog component. Classes that expect a
TableIdentifier should continue to use TableIdentifier to ensure that
catalogs are not leaked into code paths that do not support them.
@rdblue rdblue force-pushed the SPARK-25006-add-catalog-to-table-identifier branch from 00295ee to beebccf Compare November 29, 2018 23:13
@rdblue
Copy link
Contributor Author

rdblue commented Nov 29, 2018

Rebased on master.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99480 has finished for PR 21978 at commit beebccf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedRelation(table: CatalogTableIdentifier) extends LeafNode
  • sealed trait IdentifierWithOptionalDatabaseAndCatalog
  • case class CatalogTableIdentifier(table: String, database: Option[String], catalog: Option[String])
  • class TableIdentifier(table: String, db: Option[String])

val identifier: String

def database: Option[String]

def catalog: Option[String]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an abstract method definition and catalog is always implemented by a val.

@mccheah
Copy link
Contributor

mccheah commented Feb 28, 2019

Think there's a failing build, also do we still need this or have the underlying ideas changed in our discussion? My understanding is that we still need this and that catalog identifiers are important to start with to build the follow up table catalog APIs on.

Also should this include multi-part identifier?

@jzhuge
Copy link
Member

jzhuge commented Feb 28, 2019 via email

@rdblue
Copy link
Contributor Author

rdblue commented Mar 29, 2019

Identifiers for multi-catalog support were added in #23848. I'm closing this.

@rdblue rdblue closed this Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants