[SPARK-25006][SQL] Add CatalogTableIdentifier. #21978

rdblue · 2018-08-02T21:38:16Z

What changes were proposed in this pull request?

This adds CatalogTableIdentifier, which is an identifier that consists of a triple: catalog, database, and table. Catalog and database are optional.

The existing TableIdentifier class extends CatalogTableIdentifier and is guarateed to have no defined catalog component. Classes that expect a TableIdentifier will continue to use TableIdentifier to ensure that catalogs are not leaked into code paths that do not support them.

This adds a parser rule, catalogTableIdentifier, that can parse identifiers with a catalog. An identifier with only two components will match database and table, leaving the catalog undefined. Only identifiers with three components will have a defined catalog. In addition, rules must be re-written to support catalogTableIdentifier. Existing rules will continue to use tableIdentifier with no catalog.

How was this patch tested?

Existing tests. This should not change any behavior.

rdblue · 2018-08-02T21:39:47Z

@gatorsmile and @cloud-fan, this adds catalog to TableIdentifier in preparation for multi-catalog support. TableIdentifier continues to work as-is to ensure that there are no behavior changes in code paths that do not have catalog support. I've updated UnresolvedRelation to demonstrate how migration to CatalogTableIdentifier will work.

rdblue · 2018-08-03T18:16:06Z

Retest this please.

rdblue · 2018-08-04T00:06:33Z

FYI @jzhuge

rdblue · 2018-08-05T00:30:20Z

Retest this please.

cloud-fan · 2018-08-07T02:05:01Z

I'd like to wait for #17185

#17185 allows users to do db1.table1.col1, and we can later extend it to catalog1.db1.table1.col1.

We should also update the column resolution logic to consider catalog name.

rdblue · 2018-08-07T16:26:44Z

@cloud-fan, that's fine with me since #17185 is already merged. Would this conflict with #17185? We can just add a case that detects whether the first identifier in the seq is a catalog when updating expressions.

This PR is just the start for adding catalog to table identifiers. None of the SQL statements are modified in this PR on purpose: code paths will need to be updated to support CatalogTableIdentifier. This introduces the class so we can use type safety to ensure that CatalogTableIdentifier doens't leak into the code paths that don't support it.

SparkQA · 2018-08-07T17:17:14Z

Test build #94375 has finished for PR 21978 at commit 00295ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedRelation(table: CatalogTableIdentifier) extends LeafNode
sealed trait IdentifierWithOptionalDatabaseAndCatalog
case class CatalogTableIdentifier(table: String, database: Option[String], catalog: Option[String])
class TableIdentifier(table: String, db: Option[String])

rdblue · 2018-08-08T16:01:43Z

@cloud-fan, when do you think we can get this in? It doesn't need to go in 2.4 because it doesn't change any read or write paths -- nothing uses CatalogTableIdentifier yet -- but it would be great to get it into master so we can start building paths that do support CatalogTableIdentifier.

mccheah · 2018-11-22T01:09:24Z

Wanted to follow up here - are we planning on merging this or are there more things we need to discuss?

mccheah

I think this is fine, just some minor comments.

mccheah · 2018-11-29T17:11:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

@@ -18,48 +18,106 @@
 package org.apache.spark.sql.catalyst

 /**
- * An identifier that optionally specifies a database.
+ * An identifier that optionally specifies a database and catalog.
 *
 * Format (unquoted): "name" or "db.name"


Update formats in these scaladocs.

mccheah · 2018-11-29T17:12:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

- * unquotedString as the function name.
+ * Identifies a table in a database and catalog.
+ * If `database` is not defined, the current catalog's default database is used.
+ * If `catalog` is not defined, the current catalog is used.


"current" meaning "global"?

No, we want to move away from a special global catalog. I think that Spark should have a current catalog, like a current database, which is used to resolve references that don't have an explicit catalog. That would have a default, just like the current database has a default.

Sounds good. When we add the logical side of leveraging catalogs we can revisit the API of how to set the current catalog.

Agreed. This introduces the ability to expose a catalog to Spark. It doesn't actually add any user-facing operations.

This adds CatalogTableIdentifier, which is an identifier that consists of a triple: catalog, database, and table. Catalog and database are optional. The existing TableIdentifier class extends CatalogTableIdentifier and is guarateed to have no defined catalog component. Classes that expect a TableIdentifier should continue to use TableIdentifier to ensure that catalogs are not leaked into code paths that do not support them.

rdblue · 2018-11-29T23:15:02Z

Rebased on master.

SparkQA · 2018-11-29T23:21:47Z

Test build #99480 has finished for PR 21978 at commit beebccf.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedRelation(table: CatalogTableIdentifier) extends LeafNode
sealed trait IdentifierWithOptionalDatabaseAndCatalog
case class CatalogTableIdentifier(table: String, database: Option[String], catalog: Option[String])
class TableIdentifier(table: String, db: Option[String])

jzhuge · 2019-01-10T06:17:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

  val identifier: String

  def database: Option[String]

+  def catalog: Option[String]


Default to None?

This is an abstract method definition and catalog is always implemented by a val.

mccheah · 2019-02-28T03:26:26Z

Think there's a failing build, also do we still need this or have the underlying ideas changed in our discussion? My understanding is that we still need this and that catalog identifiers are important to start with to build the follow up table catalog APIs on.

Also should this include multi-part identifier?

jzhuge · 2019-02-28T05:14:47Z

Hi Matt, Agree we still need it. My PR for SPARK-26946 to implement multi-part identifier will be built on top of this because CatalogTableIdentifier at least provides a good way to incrementally migrate code paths. Do not review that PR yet, I will update in a few days, then we will have a better picture. Thanks,

…

On Wed, Feb 27, 2019 at 7:27 PM mccheah ***@***.***> wrote: Think there's a failing build, also do we still need this or have the underlying ideas changed in our discussion? My understanding is that we still need this and that catalog identifiers are important to start with to build the follow up table catalog APIs on. Also should this include multi-part identifier? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21978 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABy-pIdmtFgKnHpqK38MHEv1viN5eGtgks5vR0yTgaJpZM4VtC5y> .

-- John Zhuge

rdblue · 2019-03-29T15:42:16Z

Identifiers for multi-catalog support were added in #23848. I'm closing this.