-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... #2397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This feature allows user to add cache table from the select query. Example : ADD CACHE TABLE <tableName> AS SELECT * FROM TEST_TABLE. Spark takes this type of SQL as command and it does eager caching. It can be executed from SQLContext and HiveContext. Signed-off-by: ravipesala <ravindra.pesala@huawei.com>
This feature allows user to add cache table from the select query. Example : ADD CACHE TABLE <tableName> AS SELECT * FROM TEST_TABLE. Spark takes this type of SQL as command and it does eager caching. It can be executed from SQLContext and HiveContext. Signed-off-by: ravipesala <ravindra.pesala@huawei.com>
Add-Cache-table-as
Can one of the admins verify this patch? |
ok to test. |
QA tests have started for PR 2397 at commit
|
QA tests have finished for PR 2397 at commit
|
CACHE ~ TABLE ~> ident ~ opt(AS ~ select) <~ opt(";") ^^ { | ||
case tableName ~ None => | ||
CacheCommand(tableName, true) | ||
case tableName ~ Some(as ~ plan) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove as ~
if we use opt(AS ~> select)
in line 186.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. I have updated it as per your comment.
To me the only important issue here is the laziness semantics of Otherwise LGTM except for some minor issues mentioned in the comments. |
QA tests have started for PR 2397 at commit
|
Making lazy cache seems to be a good idea.As |
QA tests have finished for PR 2397 at commit
|
I agree with making it lazy. Can you also update the title and the description to reflect the final syntax? |
CACHE TABLE <name> AS SELECT ...
CACHE TABLE <name> AS SELECT ...
Changed the behavior from eager to lazy caching. And also updated the description. |
QA tests have started for PR 2397 at commit
|
QA tests have finished for PR 2397 at commit
|
extends LeafNode with Command { | ||
|
||
override protected[sql] lazy val sideEffectResult = { | ||
sqlContext.catalog.registerTable(None, tableName, sqlContext.executePlan(plan).analyzed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Probably my final comment on this PR :) )
As described in PR #2382, we shouldn't store analyzed logical plan when registering tables any more (see here).
To prevent duplicated code, I'd suggest to import SQLContext._
so that we can leverage the implicit conversion from LogicalPlan
to SchemaRDD
, and then simply do this:
sqlContext.executePlan(plan).logical.registerTempTable(tableName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. It is a good idea to import sqlContext._
. But we can simplify as below code if we import it. Please comment on it.
import sqlContext._
plan.registerTempTable(tableName)
cacheTable(tableName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes you're right, we can use plan
directly. And instead of importing sqlContext._
, I'd import SQLContext._
in the import section at the begin of this file:
import org.apache.spark.sql.SQLContext._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we cannot use the import org.apache.spark.sql.SQLContext._
at the beginning of file to use implicit. Because there is no object
defined for SQLContext
and implicits are only part of class SQLContext
. We can only use the import on instance like import sqlContext._
Please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again, you're right, I mistook sqlContext._
for SparkContext._
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code. Please review
LGTM except for the analyzed logical plan issue as mentioned in my last comment. Thanks for working on this! |
QA tests have started for PR 2397 at commit
|
Updated as per comments. Please review. |
QA tests have finished for PR 2397 at commit
|
@ravipesala Thanks for working on this! @marmbrus I think this is ready to go :) |
Thanks! I've merged this to master. |
This feature allows user to add cache table from the select query.
Example :
CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE
Spark takes this type of SQL as command and it does lazy caching just like
SQLContext.cacheTable
,CACHE TABLE <name>
does.It can be executed from both SQLContext and HiveContext.
Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests.
#2381
#2390
Author : ravipesala ravindra.pesala@huawei.com