[SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... #2381

ravipesala · 2014-09-13T18:23:18Z

This feature allows user to add cache table from the select query.
Example : ADD CACHE TABLE AS SELECT * FROM TEST_TABLE.
Spark takes this type of SQL as command and it does eager caching.
It can be executed from SQLContext and HiveContext.

Author : ravipesala ravindra.pesala@huawei.com

This feature allows user to add cache table from the select query. Example : ADD CACHE TABLE <tableName> AS SELECT * FROM TEST_TABLE. Spark takes this type of SQL as command and it does eager caching. It can be executed from SQLContext and HiveContext. Signed-off-by: ravipesala <ravindra.pesala@huawei.com>

SparkQA · 2014-09-13T18:27:07Z

Can one of the admins verify this patch?

marmbrus · 2014-09-13T18:33:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala

@@ -181,6 +182,12 @@ class SqlParser extends StandardTokenParsers with PackratParsers {
        val overwrite: Boolean = o.getOrElse("") == "OVERWRITE"
        InsertIntoTable(r, Map[String, Option[String]](), s, overwrite)
    }
+
+  protected lazy val addCache: Parser[LogicalPlan] =
+    ADD ~ CACHE ~ TABLE ~> ident ~ AS ~ select <~ opt(";") ^^ {


Sorry for the confusion, I had intended the syntax to be CACHE TABLE AS SELECT ... to match CREATE TABLE AS SELECT. The "Add", was just about adding support to Spark SQL.

Thanks for your comments. Sorry for misunderstanding I updated as per the syntax CACHE TABLE AS SELECT ...

marmbrus · 2014-09-13T18:45:48Z

Thanks for working on this! A few minor comments.

SparkQA · 2014-09-13T18:48:08Z

QA tests have started for PR 2381 at commit 6758f80.

This patch merges cleanly.

SparkQA · 2014-09-13T18:49:06Z

QA tests have finished for PR 2381 at commit 6758f80.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaSparkContext(val sc: SparkContext)
- class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
- class Dummy(object):
- case class CacheTableAsSelectCommand(tableName: String, plan: LogicalPlan) extends Command
- case class CacheTableAsSelectCommand(tableName: String,plan: LogicalPlan)(
- class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

* Fixed random typo * Added in missing description for DecimalType Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes apache#2367 from nchammas/patch-1 and squashes the following commits: aa528be [Nicholas Chammas] doc fix for SQL DecimalType 3247ac1 [Nicholas Chammas] [SQL] [Docs] typo fixes

This is a follow up of apache#2352. Now we can finally remove the evil "MINOR HACK", which covered up the eldest bug in the history of Spark SQL (see details [here](apache#2352 (comment))). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#2377 from liancheng/remove-evil-minor-hack and squashes the following commits: 0869c78 [Cheng Lian] Removes the evil MINOR HACK

…rage This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with `SpecificMutableRow`. The difficult part is how to adapt all compression schemes, esp. `RunLengthEncoding` and `DictionaryEncoding`, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR. **UPDATE** This PR also took the chance to optimize `HiveTableScan` by 1. leveraging `SpecificMutableRow` to avoid boxing cost, and 1. building specific `Writable` unwrapper functions a head of time to avoid per row pattern matching and branching costs. TODO - [x] Benchmark - [ ] ~~Eliminate boxing costs in `RunLengthEncoding`~~ (left to future PRs) - [ ] ~~Eliminate boxing costs in `DictionaryEncoding` (seems not easy to do without specializing `DictionaryEncoding` for every supported column type)~~ (left to future PRs) ## Micro benchmark The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table. Benchmark code can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-hivetablescanbenchmark-scala). Script used to generate the input table can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-tablegen-scala). Speedup: - Hive table scanning + column buffer building: **18.74%** The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster. - In-memory table scanning: **7.95%** Before: | Building | Scanning ------- | -------- | -------- 1 | 16472 | 525 2 | 16168 | 530 3 | 16386 | 529 4 | 16184 | 538 5 | 16209 | 521 Average | 16283.8 | 528.6 After: | Building | Scanning ------- | -------- | -------- 1 | 13124 | 458 2 | 13260 | 529 3 | 12981 | 463 4 | 13214 | 483 5 | 13583 | 500 Average | 13232.4 | 486.6 Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#2327 from liancheng/prevent-boxing/unboxing and squashes the following commits: 4419fe4 [Cheng Lian] Addressing comments e5d2cf2 [Cheng Lian] Bug fix: should call setNullAt when field value is null to avoid NPE 8b8552b [Cheng Lian] Only checks for partition batch pruning flag once 489f97b [Cheng Lian] Bug fix: TableReader.fillObject uses wrong ordinals 97bbc4e [Cheng Lian] Optimizes hive.TableReader by by providing specific Writable unwrappers a head of time 3dc1f94 [Cheng Lian] Minor changes to eliminate row object creation 5b39cb9 [Cheng Lian] Lowers log level of compression scheme details f2a7890 [Cheng Lian] Use SpecificMutableRow in InMemoryColumnarTableScan to avoid boxing 9cf30b0 [Cheng Lian] Added row based ColumnType.append/extract 456c366 [Cheng Lian] Made compression decoder row based edac3cd [Cheng Lian] Makes ColumnAccessor.extractSingle row based 8216936 [Cheng Lian] Removes boxing cost in IntDelta and LongDelta by providing specialized implementations b70d519 [Cheng Lian] Made some in-memory columnar storage interfaces row-based

Author: Michael Armbrust <michael@databricks.com> Closes apache#2164 from marmbrus/shufflePartitions and squashes the following commits: 0da1e8c [Michael Armbrust] test hax ef2d985 [Michael Armbrust] more test hacks. 2dabae3 [Michael Armbrust] more test fixes 0bdbf21 [Michael Armbrust] Make parquet tests less order dependent b42eeab [Michael Armbrust] increase test parallelism 80453d5 [Michael Armbrust] Decrease partitions when testing

Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts. This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming. For a job with broadcast (43M after compress): ``` b = sc.broadcast(set(range(30000000))) print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count() ``` It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks. It's enabled by default, could be disabled by `spark.python.worker.reuse = false`. Author: Davies Liu <davies.liu@gmail.com> Closes apache#2259 from davies/reuse-worker and squashes the following commits: f11f617 [Davies Liu] Merge branch 'master' into reuse-worker 3939f20 [Davies Liu] fix bug in serializer in mllib cf1c55e [Davies Liu] address comments 3133a60 [Davies Liu] fix accumulator with reused worker 760ab1f [Davies Liu] do not reuse worker if there are any exceptions 7abb224 [Davies Liu] refactor: sychronized with itself ac3206e [Davies Liu] renaming 8911f44 [Davies Liu] synchronized getWorkerBroadcasts() 6325fc1 [Davies Liu] bugfix: bid >= 0 e0131a2 [Davies Liu] fix name of config 583716e [Davies Liu] only reuse completed and not interrupted worker ace2917 [Davies Liu] kill python worker after timeout 6123d0f [Davies Liu] track broadcasts for each worker 8d2f08c [Davies Liu] reuse python worker

Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI. ![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png) This patch is blocked by SPARK-3465. (It includes a fix for that). Author: Davies Liu <davies.liu@gmail.com> Closes apache#2336 from davies/metrics and squashes the following commits: e37df38 [Davies Liu] remove outdated comments 1245eb7 [Davies Liu] remove the temporary fix ebd2f43 [Davies Liu] Merge branch 'master' into metrics 7e4ad04 [Davies Liu] Merge branch 'master' into metrics fbe9029 [Davies Liu] show spilled bytes in Python in web ui

This feature allows user to add cache table from the select query. Example : ADD CACHE TABLE <tableName> AS SELECT * FROM TEST_TABLE. Spark takes this type of SQL as command and it does eager caching. It can be executed from SQLContext and HiveContext. Signed-off-by: ravipesala <ravindra.pesala@huawei.com>

Add-Cache-table-as

davies · 2014-09-15T04:25:29Z

It seems that there are some unrelated changes in it, could you rebase with master?

ravipesala · 2014-09-15T10:41:38Z

As there is a confusion in rebasing, I have created a new pull request #2397 rebased with master and also fixed the review comments raised here.

liancheng · 2014-09-17T22:12:04Z

Mind to close this PR?

ravipesala · 2014-09-18T10:18:26Z

OK. Closing this PR

This feature allows user to add cache table from the select query. Example : ```CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE``` Spark takes this type of SQL as command and it does lazy caching just like ```SQLContext.cacheTable```, ```CACHE TABLE <name>``` does. It can be executed from both SQLContext and HiveContext. Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests. #2381 #2390 Author : ravipesala ravindra.pesalahuawei.com Author: ravipesala <ravindra.pesala@huawei.com> Closes #2397 from ravipesala/SPARK-2594 and squashes the following commits: a5f0beb [ravipesala] Simplified the code as per Admin comment. 8059cd2 [ravipesala] Changed the behaviour from eager caching to lazy caching. d6e469d [ravipesala] Code review comments by Admin are handled. c18aa38 [ravipesala] Merge remote-tracking branch 'remotes/ravipesala/Add-Cache-table-as' into SPARK-2594 394d5ca [ravipesala] Changed style fb1759b [ravipesala] Updated as per Admin comments 8c9993c [ravipesala] Changed the style d8b37b2 [ravipesala] Updated as per the comments by Admin bc0bffc [ravipesala] Merge remote-tracking branch 'ravipesala/Add-Cache-table-as' into Add-Cache-table-as e3265d0 [ravipesala] Updated the code as per the comments by Admin in pull request. 724b9db [ravipesala] Changed style aaf5b59 [ravipesala] Added comment dc33895 [ravipesala] Updated parser to support add cache table command b5276b2 [ravipesala] Updated parser to support add cache table command eebc0c1 [ravipesala] Add CACHE TABLE <name> AS SELECT ... 6758f80 [ravipesala] Changed style 7459ce3 [ravipesala] Added comment 13c8e27 [ravipesala] Updated parser to support add cache table command 4e858d8 [ravipesala] Updated parser to support add cache table command b803fc8 [ravipesala] Add CACHE TABLE <name> AS SELECT ...

This feature allows user to add cache table from the select query. Example : ```CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE``` Spark takes this type of SQL as command and it does lazy caching just like ```SQLContext.cacheTable```, ```CACHE TABLE <name>``` does. It can be executed from both SQLContext and HiveContext. Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests. apache/spark#2381 apache/spark#2390 Author : ravipesala ravindra.pesalahuawei.com Author: ravipesala <ravindra.pesala@huawei.com> Closes #2397 from ravipesala/SPARK-2594 and squashes the following commits: a5f0beb [ravipesala] Simplified the code as per Admin comment. 8059cd2 [ravipesala] Changed the behaviour from eager caching to lazy caching. d6e469d [ravipesala] Code review comments by Admin are handled. c18aa38 [ravipesala] Merge remote-tracking branch 'remotes/ravipesala/Add-Cache-table-as' into SPARK-2594 394d5ca [ravipesala] Changed style fb1759b [ravipesala] Updated as per Admin comments 8c9993c [ravipesala] Changed the style d8b37b2 [ravipesala] Updated as per the comments by Admin bc0bffc [ravipesala] Merge remote-tracking branch 'ravipesala/Add-Cache-table-as' into Add-Cache-table-as e3265d0 [ravipesala] Updated the code as per the comments by Admin in pull request. 724b9db [ravipesala] Changed style aaf5b59 [ravipesala] Added comment dc33895 [ravipesala] Updated parser to support add cache table command b5276b2 [ravipesala] Updated parser to support add cache table command eebc0c1 [ravipesala] Add CACHE TABLE <name> AS SELECT ... 6758f80 [ravipesala] Changed style 7459ce3 [ravipesala] Added comment 13c8e27 [ravipesala] Updated parser to support add cache table command 4e858d8 [ravipesala] Updated parser to support add cache table command b803fc8 [ravipesala] Add CACHE TABLE <name> AS SELECT ...

ravipesala added 5 commits September 11, 2014 15:53

Updated parser to support add cache table command

4e858d8

Updated parser to support add cache table command

13c8e27

Added comment

7459ce3

Changed style

6758f80

marmbrus reviewed Sep 13, 2014
View reviewed changes

nchammas and others added 11 commits September 13, 2014 12:34

Updated parser to support add cache table command

b5276b2

Updated parser to support add cache table command

dc33895

Added comment

aaf5b59

Changed style

724b9db

ravipesala added 2 commits September 15, 2014 03:16

Updated the code as per the comments by Admin in pull request.

e3265d0

Merge remote-tracking branch 'ravipesala/Add-Cache-table-as' into

bc0bffc

Add-Cache-table-as

ravipesala mentioned this pull request Sep 14, 2014

[SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... (Updated as per review comments) #2390

Closed

ravipesala added 4 commits September 15, 2014 11:32

Updated as per the comments by Admin

d8b37b2

Changed the style

8c9993c

Updated as per Admin comments

fb1759b

Changed style

394d5ca

ravipesala mentioned this pull request Sep 15, 2014

[SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... #2397

Closed

ravipesala closed this Sep 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... #2381

[SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... #2381

Uh oh!

ravipesala commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

marmbrus Sep 13, 2014

Uh oh!

ravipesala Sep 15, 2014

Uh oh!

marmbrus commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

davies commented Sep 15, 2014

Uh oh!

ravipesala commented Sep 15, 2014

Uh oh!

liancheng commented Sep 17, 2014

Uh oh!

ravipesala commented Sep 18, 2014

Uh oh!

Uh oh!

[SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... #2381

[SPARK-2594][SQL] Add CACHE TABLE <name> AS SELECT ... #2381

Uh oh!

Conversation

ravipesala commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

marmbrus Sep 13, 2014

Choose a reason for hiding this comment

Uh oh!

ravipesala Sep 15, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

davies commented Sep 15, 2014

Uh oh!

ravipesala commented Sep 15, 2014

Uh oh!

liancheng commented Sep 17, 2014

Uh oh!

ravipesala commented Sep 18, 2014

Uh oh!

Uh oh!