You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method
### What changes were proposed in this pull request?
This PR aims the followings.
- Refactor `TPCDSQueryBenchmark` to use main method to improve the usability.
- Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that).
- Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.)
- AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used.
This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following.
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
### Why are the changes needed?
Although the generated TPCDS data is random, we can keep the record.
### Does this PR introduce any user-facing change?
No. (This is dev-only test benchmark).
### How was this patch tested?
Manually run the benchmark. Please note that you need to have TPCDS data.
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1"
```
Closes#26049 from dongjoon-hyun/SPARK-25668.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
0 commit comments