Skip to content

[SPARK-27322][SQL] DataSourceV2 table relation #24741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

jzhuge
Copy link
Member

@jzhuge jzhuge commented May 30, 2019

What changes were proposed in this pull request?

Support multi-catalog in the following SELECT code paths:

  • SELECT * FROM catalog.db.tbl
  • TABLE catalog.db.tbl
  • JOIN or UNION tables from different catalogs
  • SparkSession.table("catalog.db.tbl")
  • CTE relation
  • View text

How was this patch tested?

New unit tests.
All existing unit tests in catalyst and sql core.

@SparkQA
Copy link

SparkQA commented May 30, 2019

Test build #105947 has finished for PR 24741 at commit c504b2f.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TableIdentifierHelper extends LookupCatalog
  • implicit class TableIdentifierHelper(parts: Seq[String])
  • case class UnresolvedRelation(table: TableIdentifierLike)
  • sealed trait TableIdentifierLike
  • case class CatalogTableIdentifier(catalog: TableCatalog, ident: Identifier)
  • class AstBuilder(conf: SQLConf)

@jzhuge
Copy link
Member Author

jzhuge commented May 30, 2019

There are 2 major design points in this PR:

  1. Create TableIdentifierLike interface to ease the migration from legacy TableIdentifier to CatalogTableIdentifier
  2. Resolve multipart table identifier to CatalogTableIdentifier/TableIdentifier in AstBuilder. Don't quite like this choice but it seems to minimize changes in this PR comparing to alternatives that I will list below.

I discovered these sore spots so far in the transition from TableIdentifier to CatalogTableIdentifier:

  • CTE relation
  • View text
  • Hints

Alternatives to resolving multipart table identifier in AstBuilder

  • Parser.parsePlan
  • Analyzer. We can switch to this choice after taking care of the sore spots.

@SparkQA
Copy link

SparkQA commented May 30, 2019

Test build #105973 has finished for PR 24741 at commit d259020.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TableIdentifierHelper extends LookupCatalog
  • implicit class TableIdentifierHelper(parts: Seq[String])
  • case class UnresolvedRelation(table: TableIdentifierLike)
  • sealed trait TableIdentifierLike
  • case class CatalogTableIdentifier(catalog: TableCatalog, ident: Identifier)
  • class AstBuilder(conf: SQLConf)

@jzhuge
Copy link
Member Author

jzhuge commented May 31, 2019

Please hold off review. Testing changes suggested by Ryan to move resolution to Analyzer.

@rdblue
Copy link
Contributor

rdblue commented May 31, 2019

@jzhuge and I have been working on a version that does the table resolution in the analyzer instead of in AstBuilder, which should be cleaner to keep the parser code separate from the implementation.

@SparkQA
Copy link

SparkQA commented Jun 1, 2019

Test build #106049 has finished for PR 24741 at commit b6eccd0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2019

Test build #106050 has finished for PR 24741 at commit daffc52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2019

Test build #106060 has finished for PR 24741 at commit c04d3b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jzhuge
Copy link
Member Author

jzhuge commented Jun 3, 2019

@cloud-fan @dongjoon-hyun @HyukjinKwon This PR is ready for review.

@rdblue and I have switched Relation Resolution from AstBuilder (as in the original commit) to DataSourceResolution in Analyzer. The core logic is much simpler and cleaner. We were able to overcome the obstacles I described in the first comment either with mostly minor fixes in this PR, or a separate PR #24763 for more thorough review.

@jzhuge
Copy link
Member Author

jzhuge commented Jun 4, 2019

Rebased and squashed.

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106159 has finished for PR 24741 at commit b1e04cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ResolveJoinStrategyHints(
  • case class UnresolvedRelation(multipartIdentifier: Seq[String]) extends LeafNode

@@ -153,7 +153,8 @@ object HiveAnalysis extends Rule[LogicalPlan] {
case CreateTable(tableDesc, mode, None) if DDLUtils.isHiveTable(tableDesc) =>
CreateTableCommand(tableDesc, ignoreIfExists = mode == SaveMode.Ignore)

case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) =>
case CreateTable(tableDesc, mode, Some(query))
if DDLUtils.isHiveTable(tableDesc) && query.resolved =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a separate issue. Could we submit a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR prevents lookupTableFromCatalog from throwing NoSuchTableException right away. Instead, it relies on checkAnalysis to throw an exception for UnresolvedRelation.

The test hive.SQLQuerySuite."double nested data" would fail in the following sql without this change:

CREATE TABLE test_ctas_1234 AS SELECT * from notexists

HiveAnalysis gets to run before checkAnalysis, thus exposing this bug where query.output is used before query is resolved.

So wouldn't say it is a totally separate issue.
In addition, outside of this PR, it'd be hard to write a unit test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with John. This is needed as a consequence of fixing the ResolveRelations rule to no longer throw AnalysisException if it can't resolve the name and doesn't think that ResolveSQLOnFile would either.

@@ -185,6 +186,8 @@ abstract class BaseSessionStateBuilder(
V2WriteSupportCheck +:
V2StreamingScanSupportCheck +:
customCheckRules

override protected def lookupCatalog(name: String): CatalogPlugin = session.catalog(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not catalog(name)? Any difference between catalog(name) and session.catalog(name)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkSession.catalog returns CatalogPlugin for DSv2 while BaseSessionStateBuilder.catalog is a SessionCatalog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not get the difference. How to avoid misusing them?

  • One is our global catalog? Another is the local catalogs?
  • What is the semantics of lookupCatalog for SessionCatalog?
  • What is the semantics of lookupCatalog for CatalogPlugin for DSv2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions. Please check out @rdblue's #24768 first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#24768 passes a single LookupCatalog (the analyzer), as you suggested in the other comment. I like using the same lookup everywhere instead of having multiple classes implement LookupCatalog.

// Note that if the database is not defined, it is possible we are looking up a temp view.
case e: NoSuchDatabaseException =>
u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}, the " +
s"database ${e.db} doesn't exist.", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original error messages are still helpful. Let us keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not possible since Analysis exception is thrown by checkAnalysis now:

      case u: UnresolvedRelation =>
        u.failAnalysis(s"Table or view not found: ${u.multipartIdentifier.quoted}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile, we plan to update checkAnalysis to produce more friendly error messages, but not until #24560 is merged. Without that, we can't check whether the namespace exists to produce the right error message.

I should also note that checkAnalysis is the right place for the exception to be thrown. Individual rules should not fail analysis. In this case, a different rule for looking up tables in v2 catalogs is used. And later, an UnresolvedRelation could be resolved by an independent ResolveViews rule. Allowing these rules to be separate makes them smaller and doesn't mix view handling and table handling, as we see in this current rule.

@gatorsmile
Copy link
Member

I did a quick pass and left a few comments. @jzhuge Thank you for your work!

@jiangxb1987 Please take a look, especially about the test case coverage?

@jzhuge
Copy link
Member Author

jzhuge commented Jun 5, 2019

Rebase and fix review comments.

@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106190 has finished for PR 24741 at commit 62811ff.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedRelation(multipartIdentifier: Seq[String]) extends LeafNode

@SparkQA
Copy link

SparkQA commented Jun 11, 2019

Test build #106392 has finished for PR 24741 at commit 01b720e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jzhuge
Copy link
Member Author

jzhuge commented Jun 12, 2019

Rebase and squash

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106397 has finished for PR 24741 at commit 62b76e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106399 has finished for PR 24741 at commit 2568288.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

try {
Option(catalog.asTableCatalog.loadTable(ident))
} catch {
case _: NoSuchTableException => None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we return None for NoSuchTableException only and propagate exceptions for all catalog errors like CatalogNotFoundException from loadTable and AnalysisException from asTableCatalog?

Copy link
Member Author

@jzhuge jzhuge Jun 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

BTW, I don't think TableCatalog.loadTable throws CatalogNotFoundException because catalog plugin has already been found.

*
* [[ResolveRelations]] still resolves v1 tables.
*/
object ResolveTables extends Rule[LogicalPlan] {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ResolveV2Relations instead in order to avoid those confusion?

-   * Resolve table relations with concrete relations from v2 catalog.
-   *
-   * [[ResolveRelations]] still resolves v1 tables.
+   * Replaces [[UnresolvedRelation]]s with concrete relations from the v2 catalog.
    */
-  object ResolveTables extends Rule[LogicalPlan] {
+  object ResolveV2Relations extends Rule[LogicalPlan] {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the above comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name it ResolveTables because there may be a new rule ResolveViews down the road, which will be part of ViewCatalog effort. More details to come.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106439 has finished for PR 24741 at commit 2720d82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1694,8 +1694,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
e = intercept[AnalysisException] {
sql(s"select id from `org.apache.spark.sql.sources.HadoopFsRelationProvider`.`file_path`")
}
assert(e.message.contains("Table or view not found: " +
"`org.apache.spark.sql.sources.HadoopFsRelationProvider`.`file_path`"))
assert(e.message.contains("Table or view not found"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Shall we keep the original form because only backticks are gone?

    assert(e.message.contains("Table or view not found: " +
      "`org.apache.spark.sql.sources.HadoopFsRelationProvider`.file_path"))

@dongjoon-hyun
Copy link
Member

Mostly, looks correct. We need to fix INSERT OVERWRITE DIR case and the others are minor for now. Thank you, @jzhuge .

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106449 has finished for PR 24741 at commit b8cdf6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}, the " +
s"database ${e.db} doesn't exist.", e)
case _: NoSuchTableException | _: NoSuchDatabaseException =>
u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some comments to explain why we need to delay the exception here. To me it's because we still have a chance to resolve the table relation with v2 rules.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 13, 2019

I have only one comment about adding more code comments, which can be addressed later. I'm merging it to unblock the DS v2 project, thanks for your hard work @jzhuge @rdblue !

@cloud-fan cloud-fan closed this in abe370f Jun 13, 2019
@jzhuge
Copy link
Member Author

jzhuge commented Jun 13, 2019

Thanks @cloud-fan @dongjoon-hyun @gatorsmile @rdblue for the excellent reviews! Thanks @rdblue for the great help!

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

Support multi-catalog in the following SELECT code paths:

- SELECT * FROM catalog.db.tbl
- TABLE catalog.db.tbl
- JOIN or UNION tables from different catalogs
- SparkSession.table("catalog.db.tbl")
- CTE relation
- View text

## How was this patch tested?

New unit tests.
All existing unit tests in catalyst and sql core.

Closes apache#24741 from jzhuge/SPARK-27322-pr.

Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants