add configuration for partition_metadata #3

CodingCat · 2017-10-27T17:39:33Z

add configuration entry for partition level metadata
CachedRDD and CachedRDDPartition framework
Implement update of partition level metadata in InMemoryRelation
Implement Filter Evaluation in InMemoryTableScanExec
Implement CachedRDD compute method
add test

fix compilation of tests fix tests revise the test fix test revise the test add missing file revise the test revise the test revise the test revise the test revise the test revise the test revise the test revise the test

test for remove metadata block fix the test fix the test fix the test

### What changes were proposed in this pull request? This PR proposes to make `PythonFunction` holds `Seq[Byte]` instead of `Array[Byte]` to be able to compare if the byte array has the same values for the cache manager. ### Why are the changes needed? Currently the cache manager doesn't use the cache for `udf` if the `udf` is created again even if the functions is the same. ```py >>> func = lambda x: x >>> df = spark.range(1) >>> df.select(udf(func)("id")).cache() ``` ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == *(2) Project [pythonUDF0#14 AS <lambda>(id)apache#12] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14] +- *(1) Range (0, 1, step=1, splits=12) ``` This is because `PythonFunction` holds `Array[Byte]`, and `equals` method of array equals only when the both array is the same instance. ### Does this PR introduce _any_ user-facing change? Yes, if the user reuse the Python function for the UDF, the cache manager will detect the same function and use the cache for it. ### How was this patch tested? I added a test case and manually. ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == InMemoryTableScan [<lambda>(id)apache#12] +- InMemoryRelation [<lambda>(id)apache#12], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(2) Project [pythonUDF0#5 AS <lambda>(id)#3] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#5] +- *(1) Range (0, 1, step=1, splits=12) ``` Closes apache#28774 from ueshin/issues/SPARK-31945/udf_cache. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

github-actions · 2020-06-17T00:08:30Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

… without WindowExpression ### What changes were proposed in this pull request? Add WindowFunction check at `CheckAnalysis`. ### Why are the changes needed? Provide friendly error msg. **BEFORE** ```scala scala> sql("select rank() from values(1)").show java.lang.UnsupportedOperationException: Cannot generate code for expression: rank() ``` **AFTER** ```scala scala> sql("select rank() from values(1)").show org.apache.spark.sql.AnalysisException: Window function rank() requires an OVER clause.;; Project [rank() AS RANK()#3] +- LocalRelation [col1#2] ``` ### Does this PR introduce _any_ user-facing change? Yes, user wiill be given a better error msg. ### How was this patch tested? Pass the newly added UT. Closes apache#28808 from ulysses-you/SPARK-31975. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

CodingCat added 4 commits October 27, 2017 12:50

improve the doc for "spark.memory.offHeap.size"

c3f1a9b

fix

6a2b3ca

add configuration for partition_metadata

6e37fa2

framework of CachedColumnarRDD

aa70660

CodingCat force-pushed the partition_level_pruning branch from 64d7bd8 to aa70660 Compare October 27, 2017 19:50

CodingCat force-pushed the master branch from 85e09f1 to d523ca5 Compare October 27, 2017 19:51

CodingCat added 24 commits October 27, 2017 15:53

code framework

d138082

remove cachedcolumnarbatchRDD

a72d779

fix styly error

0fe35f8

temp

9e34243

'CachedColumnarRDD'

677ca81

change types

df1d796

fix compilation error

08fd085

update

d4fc2b7

fix storage level

97a63d6

fix getOrCompute

a24b7bb

evaluate with partition metadata

0e8e639

fix getOrCompute

b89d58b

add logging

3f2eae7

add logging for skipped partition

507c1a2

try to print stats

40d441c

add logging for skipped partition

520e5aa

add logging for skipped partition

885808f

add logging for skipped partition

37b5971

refactor the code

4dbfe37

fix compilation issue

6165838

refactor the code

05f2267

test

bcafe82

fix compilation issue

5b888d3

add missing filtering

977b93f

CodingCat added 6 commits November 9, 2017 14:16

test

9c9bcad

test

56a4307

fix rebundant read

7936033

compact iterators

3b6bfa2

update

963ca0a

add first test case

d4f12b1

fix compilation of tests fix tests revise the test fix test revise the test add missing file revise the test revise the test revise the test revise the test revise the test revise the test revise the test revise the test

CodingCat force-pushed the partition_level_pruning branch 2 times, most recently from 9406c48 to 46d68db Compare November 11, 2017 04:26

CodingCat added 2 commits November 10, 2017 20:31

test for remove metadata block

46d68db

test for remove metadata block fix the test fix the test fix the test

generate correct results when data block is removed

77cf789

CodingCat force-pushed the master branch from d523ca5 to 0971900 Compare December 1, 2017 21:48

CodingCat force-pushed the master branch from 0971900 to 0fd33a7 Compare December 13, 2017 00:24

CodingCat force-pushed the master branch 2 times, most recently from 62a34c1 to 9b87ba8 Compare December 25, 2017 02:34

CodingCat force-pushed the master branch from 9b87ba8 to 7cbecf9 Compare January 2, 2018 03:59

CodingCat force-pushed the master branch from 7cbecf9 to 7572505 Compare July 13, 2018 02:09

github-actions bot added the Stale label Jun 17, 2020

github-actions bot closed this Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add configuration for partition_metadata #3

add configuration for partition_metadata #3

Uh oh!

CodingCat commented Oct 27, 2017 •

edited

Loading

Uh oh!

github-actions bot commented Jun 17, 2020

Uh oh!

Uh oh!

add configuration for partition_metadata #3

add configuration for partition_metadata #3

Uh oh!

Conversation

CodingCat commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 17, 2020

Uh oh!

Uh oh!

CodingCat commented Oct 27, 2017 •

edited

Loading