[SPARK-27322][SQL] DataSourceV2 table relation #24741

jzhuge · 2019-05-30T05:38:19Z

What changes were proposed in this pull request?

Support multi-catalog in the following SELECT code paths:

SELECT * FROM catalog.db.tbl
TABLE catalog.db.tbl
JOIN or UNION tables from different catalogs
SparkSession.table("catalog.db.tbl")
CTE relation
View text

How was this patch tested?

New unit tests.
All existing unit tests in catalyst and sql core.

SparkQA · 2019-05-30T05:50:55Z

Test build #105947 has finished for PR 24741 at commit c504b2f.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TableIdentifierHelper extends LookupCatalog
implicit class TableIdentifierHelper(parts: Seq[String])
case class UnresolvedRelation(table: TableIdentifierLike)
sealed trait TableIdentifierLike
case class CatalogTableIdentifier(catalog: TableCatalog, ident: Identifier)
class AstBuilder(conf: SQLConf)

jzhuge · 2019-05-30T05:54:51Z

There are 2 major design points in this PR:

Create TableIdentifierLike interface to ease the migration from legacy TableIdentifier to CatalogTableIdentifier
Resolve multipart table identifier to CatalogTableIdentifier/TableIdentifier in AstBuilder. Don't quite like this choice but it seems to minimize changes in this PR comparing to alternatives that I will list below.

I discovered these sore spots so far in the transition from TableIdentifier to CatalogTableIdentifier:

CTE relation
View text
Hints

Alternatives to resolving multipart table identifier in AstBuilder

Parser.parsePlan
Analyzer. We can switch to this choice after taking care of the sore spots.

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

SparkQA · 2019-05-30T17:49:24Z

Test build #105973 has finished for PR 24741 at commit d259020.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TableIdentifierHelper extends LookupCatalog
implicit class TableIdentifierHelper(parts: Seq[String])
case class UnresolvedRelation(table: TableIdentifierLike)
sealed trait TableIdentifierLike
case class CatalogTableIdentifier(catalog: TableCatalog, ident: Identifier)
class AstBuilder(conf: SQLConf)

jzhuge · 2019-05-31T01:20:24Z

Please hold off review. Testing changes suggested by Ryan to move resolution to Analyzer.

rdblue · 2019-05-31T16:27:34Z

@jzhuge and I have been working on a version that does the table resolution in the analyzer instead of in AstBuilder, which should be cleaner to keep the parser code separate from the implementation.

SparkQA · 2019-06-01T04:38:32Z

Test build #106049 has finished for PR 24741 at commit b6eccd0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-01T08:05:22Z

Test build #106050 has finished for PR 24741 at commit daffc52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-02T01:26:07Z

Test build #106060 has finished for PR 24741 at commit c04d3b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jzhuge · 2019-06-03T16:34:35Z

@cloud-fan @dongjoon-hyun @HyukjinKwon This PR is ready for review.

@rdblue and I have switched Relation Resolution from AstBuilder (as in the original commit) to DataSourceResolution in Analyzer. The core logic is much simpler and cleaner. We were able to overcome the obstacles I described in the first comment either with mostly minor fixes in this PR, or a separate PR #24763 for more thorough review.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

jzhuge · 2019-06-04T16:26:04Z

Rebased and squashed.

SparkQA · 2019-06-04T19:47:03Z

Test build #106159 has finished for PR 24741 at commit b1e04cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResolveJoinStrategyHints(
case class UnresolvedRelation(multipartIdentifier: Seq[String]) extends LeafNode

gatorsmile · 2019-06-04T22:01:14Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

@@ -153,7 +153,8 @@ object HiveAnalysis extends Rule[LogicalPlan] {
    case CreateTable(tableDesc, mode, None) if DDLUtils.isHiveTable(tableDesc) =>
      CreateTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore)

-    case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) =>
+    case CreateTable(tableDesc, mode, Some(query))
+        if DDLUtils.isHiveTable(tableDesc) && query.resolved =>


This sounds like a separate issue. Could we submit a separate PR?

This PR prevents lookupTableFromCatalog from throwing NoSuchTableException right away. Instead, it relies on checkAnalysis to throw an exception for UnresolvedRelation.

The test hive.SQLQuerySuite."double nested data" would fail in the following sql without this change:

CREATE TABLE test_ctas_1234 AS SELECT * from notexists

HiveAnalysis gets to run before checkAnalysis, thus exposing this bug where query.output is used before query is resolved.

So wouldn't say it is a totally separate issue.
In addition, outside of this PR, it'd be hard to write a unit test.

I agree with John. This is needed as a consequence of fixing the ResolveRelations rule to no longer throw AnalysisException if it can't resolve the name and doesn't think that ResolveSQLOnFile would either.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

gatorsmile · 2019-06-04T23:04:57Z

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

@@ -185,6 +186,8 @@ abstract class BaseSessionStateBuilder(
        V2WriteSupportCheck +:
        V2StreamingScanSupportCheck +:
        customCheckRules
+
+    override protected def lookupCatalog(name: String): CatalogPlugin = session.catalog(name)


Why not catalog(name)? Any difference between catalog(name) and session.catalog(name)?

SparkSession.catalog returns CatalogPlugin for DSv2 while BaseSessionStateBuilder.catalog is a SessionCatalog.

I might not get the difference. How to avoid misusing them?

One is our global catalog? Another is the local catalogs?

What is the semantics of lookupCatalog for SessionCatalog?

What is the semantics of lookupCatalog for CatalogPlugin for DSv2?

Good questions. Please check out @rdblue's #24768 first.

#24768 passes a single LookupCatalog (the analyzer), as you suggested in the other comment. I like using the same lookup everywhere instead of having multiple classes implement LookupCatalog.

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala

gatorsmile · 2019-06-04T23:07:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-        // Note that if the database is not defined, it is possible we are looking up a temp view.
-        case e: NoSuchDatabaseException =>
-          u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}, the " +
-            s"database ${e.db} doesn't exist.", e)


The original error messages are still helpful. Let us keep it.

Unfortunately not possible since Analysis exception is thrown by checkAnalysis now:

case u: UnresolvedRelation => u.failAnalysis(s"Table or view not found: ${u.multipartIdentifier.quoted}")

@gatorsmile, we plan to update checkAnalysis to produce more friendly error messages, but not until #24560 is merged. Without that, we can't check whether the namespace exists to produce the right error message.

I should also note that checkAnalysis is the right place for the exception to be thrown. Individual rules should not fail analysis. In this case, a different rule for looking up tables in v2 catalogs is used. And later, an UnresolvedRelation could be resolved by an independent ResolveViews rule. Allowing these rules to be separate makes them smaller and doesn't mix view handling and table handling, as we see in this current rule.

gatorsmile · 2019-06-04T23:09:26Z

I did a quick pass and left a few comments. @jzhuge Thank you for your work!

@jiangxb1987 Please take a look, especially about the test case coverage?

jzhuge · 2019-06-05T06:35:17Z

Rebase and fix review comments.

SparkQA · 2019-06-05T07:05:01Z

Test build #106190 has finished for PR 24741 at commit 62811ff.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedRelation(multipartIdentifier: Seq[String]) extends LeafNode

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

SparkQA · 2019-06-11T18:42:16Z

Test build #106392 has finished for PR 24741 at commit 01b720e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jzhuge · 2019-06-12T01:58:33Z

Rebase and squash

SparkQA · 2019-06-12T02:11:08Z

Test build #106397 has finished for PR 24741 at commit 62b76e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-12T05:11:57Z

Test build #106399 has finished for PR 24741 at commit 2568288.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

dongjoon-hyun · 2019-06-12T21:10:23Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/utils/CatalogV2Util.scala

+    try {
+      Option(catalog.asTableCatalog.loadTable(ident))
+    } catch {
+      case _: NoSuchTableException => None


So, we return None for NoSuchTableException only and propagate exceptions for all catalog errors like CatalogNotFoundException from loadTable and AnalysisException from asTableCatalog?

Yes.

BTW, I don't think TableCatalog.loadTable throws CatalogNotFoundException because catalog plugin has already been found.

dongjoon-hyun · 2019-06-12T21:40:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+   *
+   * [[ResolveRelations]] still resolves v1 tables.
+   */
+  object ResolveTables extends Rule[LogicalPlan] {


~~Can we use ResolveV2Relations instead in order to avoid those confusion?~~

- * Resolve table relations with concrete relations from v2 catalog. - * - * [[ResolveRelations]] still resolves v1 tables. + * Replaces [[UnresolvedRelation]]s with concrete relations from the v2 catalog. */ - object ResolveTables extends Rule[LogicalPlan] { + object ResolveV2Relations extends Rule[LogicalPlan] {

Please ignore the above comment.

Name it ResolveTables because there may be a new rule ResolveViews down the road, which will be part of ViewCatalog effort. More details to come.

SparkQA · 2019-06-12T22:44:10Z

Test build #106439 has finished for PR 24741 at commit 2720d82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-12T23:10:04Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -1694,8 +1694,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
    e = intercept[AnalysisException] {
      sql(s"select id from `org.apache.spark.sql.sources.HadoopFsRelationProvider`.`file_path`")
    }
-    assert(e.message.contains("Table or view not found: " +
-      "`org.apache.spark.sql.sources.HadoopFsRelationProvider`.`file_path`"))
+    assert(e.message.contains("Table or view not found"))


Nit. Shall we keep the original form because only backticks are gone?

assert(e.message.contains("Table or view not found: " + "`org.apache.spark.sql.sources.HadoopFsRelationProvider`.file_path"))

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

dongjoon-hyun · 2019-06-12T23:48:55Z

Mostly, looks correct. We need to fix INSERT OVERWRITE DIR case and the others are minor for now. Thank you, @jzhuge .

SparkQA · 2019-06-13T05:01:15Z

Test build #106449 has finished for PR 24741 at commit b8cdf6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM

cloud-fan · 2019-06-13T05:29:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-          u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}, the " +
-            s"database ${e.db} doesn't exist.", e)
+        case _: NoSuchTableException | _: NoSuchDatabaseException =>
+          u


We should add some comments to explain why we need to delay the exception here. To me it's because we still have a chance to resolve the table relation with v2 rules.

cloud-fan · 2019-06-13T05:41:41Z

I have only one comment about adding more code comments, which can be addressed later. I'm merging it to unblock the DS v2 project, thanks for your hard work @jzhuge @rdblue !

jzhuge · 2019-06-13T06:02:46Z

Thanks @cloud-fan @dongjoon-hyun @gatorsmile @rdblue for the excellent reviews! Thanks @rdblue for the great help!

## What changes were proposed in this pull request? Support multi-catalog in the following SELECT code paths: - SELECT * FROM catalog.db.tbl - TABLE catalog.db.tbl - JOIN or UNION tables from different catalogs - SparkSession.table("catalog.db.tbl") - CTE relation - View text ## How was this patch tested? New unit tests. All existing unit tests in catalyst and sql core. Closes apache#24741 from jzhuge/SPARK-27322-pr. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

jzhuge force-pushed the SPARK-27322-pr branch from c504b2f to d259020 Compare May 30, 2019 17:29

rdblue reviewed May 30, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

jzhuge mentioned this pull request Jun 1, 2019

[SPARK-27909][SQL] Do not run analysis inside CTE substitution #24763

Closed

rdblue reviewed Jun 3, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala Outdated Show resolved Hide resolved

rdblue reviewed Jun 3, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala Outdated Show resolved Hide resolved

jzhuge force-pushed the SPARK-27322-pr branch from 48a5e32 to b1e04cd Compare June 4, 2019 16:25