Skip to content

[SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... #2397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

ravipesala
Copy link
Contributor

This feature allows user to add cache table from the select query.
Example : CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE
Spark takes this type of SQL as command and it does lazy caching just like SQLContext.cacheTable, CACHE TABLE <name> does.
It can be executed from both SQLContext and HiveContext.

Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests.
#2381
#2390

Author : ravipesala ravindra.pesala@huawei.com

This feature allows user to add cache table from the select query.
Example : ADD CACHE TABLE <tableName> AS SELECT * FROM TEST_TABLE.
Spark takes this type of SQL as command and it does eager caching.
It can be executed from SQLContext and HiveContext.

Signed-off-by: ravipesala <ravindra.pesala@huawei.com>
This feature allows user to add cache table from the select query.
Example : ADD CACHE TABLE <tableName> AS SELECT * FROM TEST_TABLE.
Spark takes this type of SQL as command and it does eager caching.
It can be executed from SQLContext and HiveContext.

Signed-off-by: ravipesala <ravindra.pesala@huawei.com>
@SparkQA
Copy link

SparkQA commented Sep 15, 2014

Can one of the admins verify this patch?

@davies
Copy link
Contributor

davies commented Sep 15, 2014

ok to test.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 2397 at commit c18aa38.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 2397 at commit c18aa38.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan) extends Command
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan)

CACHE ~ TABLE ~> ident ~ opt(AS ~ select) <~ opt(";") ^^ {
case tableName ~ None =>
CacheCommand(tableName, true)
case tableName ~ Some(as ~ plan) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove as ~ if we use opt(AS ~> select) in line 186.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I have updated it as per your comment.

@liancheng
Copy link
Contributor

To me the only important issue here is the laziness semantics of CACHE TABLE AS SELECT. I tend to make it lazy because SQLContext.cacheTable, CACHE TABLE <name> are both lazy. Making all three eager also seems acceptable, but this kinda breaks downward compatibility (or at least breaks existing performance assumptions of caching functions/statements).

Otherwise LGTM except for some minor issues mentioned in the comments.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have started for PR 2397 at commit d6e469d.

  • This patch merges cleanly.

@ravipesala
Copy link
Contributor Author

Making lazy cache seems to be a good idea.As SQLContext.cacheTable, CACHE TABLE <name> are both lazy so It is better to make CACHE TABLE AS SELECT .. also lazy to avoid confusion. @marmbrus can comment on this.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have finished for PR 2397 at commit d6e469d.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan) extends Command
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan)

@marmbrus
Copy link
Contributor

I agree with making it lazy. Can you also update the title and the description to reflect the final syntax?

@ravipesala ravipesala changed the title [SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... [SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... Sep 16, 2014
@ravipesala ravipesala changed the title [SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... [SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... Sep 16, 2014
@ravipesala
Copy link
Contributor Author

Changed the behavior from eager to lazy caching. And also updated the description.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have started for PR 2397 at commit 8059cd2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have finished for PR 2397 at commit 8059cd2.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan) extends Command
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan)

extends LeafNode with Command {

override protected[sql] lazy val sideEffectResult = {
sqlContext.catalog.registerTable(None, tableName, sqlContext.executePlan(plan).analyzed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Probably my final comment on this PR :) )

As described in PR #2382, we shouldn't store analyzed logical plan when registering tables any more (see here).

To prevent duplicated code, I'd suggest to import SQLContext._ so that we can leverage the implicit conversion from LogicalPlan to SchemaRDD, and then simply do this:

sqlContext.executePlan(plan).logical.registerTempTable(tableName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. It is a good idea to import sqlContext._. But we can simplify as below code if we import it. Please comment on it.

import sqlContext._
plan.registerTempTable(tableName)
cacheTable(tableName) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes you're right, we can use plan directly. And instead of importing sqlContext._, I'd import SQLContext._ in the import section at the begin of this file:

import org.apache.spark.sql.SQLContext._

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we cannot use the import org.apache.spark.sql.SQLContext._ at the beginning of file to use implicit. Because there is no object defined for SQLContext and implicits are only part of class SQLContext. We can only use the import on instance like import sqlContext._ Please correct me if I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again, you're right, I mistook sqlContext._ for SparkContext._...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code. Please review

@liancheng
Copy link
Contributor

LGTM except for the analyzed logical plan issue as mentioned in my last comment. Thanks for working on this!

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2397 at commit a5f0beb.

  • This patch merges cleanly.

@ravipesala
Copy link
Contributor Author

Updated as per comments. Please review.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2397 at commit a5f0beb.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan) extends Command
    • case class CacheTableAsSelectCommand(tableName: String, logicalPlan: LogicalPlan)

@liancheng
Copy link
Contributor

@ravipesala Thanks for working on this! @marmbrus I think this is ready to go :)

@marmbrus
Copy link
Contributor

Thanks! I've merged this to master.

@asfgit asfgit closed this in 5522151 Sep 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants