Partition level pruning 2 #4

CodingCat · 2017-11-21T21:28:49Z

In the current implementation of Spark, InMemoryTableExec read all data in a cached table, filter CachedBatch according to stats and pass data to the downstream operators. This implementation makes it inefficient to reside the whole table in memory to serve various queries against different partitions of the table, which occupies a certain portion of our users' scenarios.

The following is an example of such a use case:

store_sales is a 1TB-sized table in cloud storage, which is partitioned by 'location'. The first query, Q1, wants to output several metrics A, B, C for all stores in all locations. After that, a small team of 3 data scientists wants to do some causal analysis for the sales in different locations. To avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole table in memory in Q1.

With the current implementation, even any one of the data scientists is only interested in one out of three locations, the queries they submit to Spark cluster is still reading 1TB data completely.

The reason behind the extra reading operation is that we implement CachedBatch as

case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: InternalRow)

where the stats is a part of every CachedBatch, so we can only filter batches for output of InMemoryTableExec operator by reading all data in in-memory table as input. The extra reading would be even more unacceptable when some of the table's data is evicted to disks.

We propose to introduce a new type of block, metadata block, for the partitions of RDD representing data in the cached table. Every metadata block contains stats info for all columns in a partition and is saved to BlockManager when executing compute() method for the partition. To minimize the number of bytes to read,

More details can be found in design doc

fix compilation of tests fix tests revise the test fix test revise the test add missing file revise the test revise the test revise the test revise the test revise the test revise the test revise the test revise the test

test for remove metadata block fix the test fix the test fix the test

…ver)QueryTestSuite ### What changes were proposed in this pull request? This PR adds 2 changes regarding exception handling in `SQLQueryTestSuite` and `ThriftServerQueryTestSuite` - fixes an expected output sorting issue in `ThriftServerQueryTestSuite` as if there is an exception then there is no need for sort - introduces common exception handling in those 2 suites with a new `handleExceptions` method ### Why are the changes needed? Currently `ThriftServerQueryTestSuite` passes on master, but it fails on one of my PRs (apache#23531) with this error (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111651/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/sql_3/): ``` org.scalatest.exceptions.TestFailedException: Expected " [Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit org.apache.spark.SparkException] ", but got " [org.apache.spark.SparkException Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit] " Result did not match for query #4 WITH RECURSIVE r(level) AS ( VALUES (0) UNION ALL SELECT level + 1 FROM r ) SELECT * FROM r ``` The unexpected reversed order of expected output (error message comes first, then the exception class) is due to this line: https://github.com/apache/spark/pull/26028/files#diff-b3ea3021602a88056e52bf83d8782de8L146. It should not sort the expected output if there was an error during execution. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes apache#26028 from peter-toth/SPARK-29359-better-exception-handling. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>

github-actions · 2020-06-17T00:08:29Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor. ### Why are the changes needed? Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario: 1. SaslTimeoutException 2. IOException 3. SaslTimeoutException 4. IOException Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test is added, courtesy of Mridul. Closes apache#39611 from tedyu/sasl-cnt. Authored-by: Ted Yu <yuzhihonggmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes apache#39710 from akpatnam25/SPARK-42090-backport-3.2. Authored-by: Ted Yu <yuzhihong@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

CodingCat added 2 commits November 23, 2017 23:50

improve the doc for "spark.memory.offHeap.size"

6a34b15

fix

3a2be8c

CodingCat force-pushed the partition_level_pruning_2 branch from b4f51ed to accd549 Compare November 24, 2017 07:50

CodingCat added 27 commits November 23, 2017 23:51

add configuration for partition_metadata

eb535d6

framework of CachedColumnarRDD

f910b27

code framework

846b032

remove cachedcolumnarbatchRDD

f511b6f

temp

50f2612

'CachedColumnarRDD'

2b945c9

change types

89f0a98

fix compilation error

5aa1808

update

a5adc56

fix storage level

00b1642

fix getOrCompute

311fe5a

evaluate with partition metadata

6411b82

fix getOrCompute

41f6ad2

add logging

c50b743

add logging for skipped partition

78f774f

try to print stats

71456bd

add logging for skipped partition

97544a6

add logging for skipped partition

c131b2d

add logging for skipped partition

d588fb0

refactor the code

1ba1f80

fix compilation issue

62f358d

refactor the code

500d4fd

test

63c5897

fix compilation issue

9031eaf

add missing filtering

1692303

test

f9f3d20

test

da5f06f

CodingCat added 18 commits November 23, 2017 23:54

fix rebundant read

2c0b6cd

compact iterators

e1d8c43

update

c900808

add first test case

2caa7fc

fix compilation of tests fix tests revise the test fix test revise the test add missing file revise the test revise the test revise the test revise the test revise the test revise the test revise the test revise the test

test for remove metadata block

a9f1256

test for remove metadata block fix the test fix the test fix the test

generate correct results when data block is removed

b5d6094

try to avoid unnecessary tasks

3b51c9a

collect data in the fly

59b0684

fix the compilation issue

3e972ce

fix the compilation issue

35c9361

fix NPE

20b72a8

fix NPE

bfa357a

fix imcompatibility of partition

0fd6b68

fix the parent inherience mechanism

dc54be5

add missing filtering for cachedBatch

1afbf13

remove CachedColumnarIterator

accd549

fix filtering logic

a853ce6

fix the failed test

9d450ad

CodingCat force-pushed the master branch from d523ca5 to 0971900 Compare December 1, 2017 21:48

CodingCat force-pushed the master branch from 0971900 to 0fd33a7 Compare December 13, 2017 00:24

CodingCat force-pushed the master branch 2 times, most recently from 62a34c1 to 9b87ba8 Compare December 25, 2017 02:34

CodingCat force-pushed the master branch from 9b87ba8 to 7cbecf9 Compare January 2, 2018 03:59

CodingCat force-pushed the master branch from 7cbecf9 to 7572505 Compare July 13, 2018 02:09

github-actions bot added the Stale label Jun 17, 2020

github-actions bot closed this Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partition level pruning 2 #4

Partition level pruning 2 #4

Uh oh!

CodingCat commented Nov 21, 2017 •

edited

Loading

Uh oh!

github-actions bot commented Jun 17, 2020

Uh oh!

Uh oh!

Partition level pruning 2 #4

Partition level pruning 2 #4

Uh oh!

Conversation

CodingCat commented Nov 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 17, 2020

Uh oh!

Uh oh!

CodingCat commented Nov 21, 2017 •

edited

Loading