[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method #26049

dongjoon-hyun · 2019-10-08T03:30:44Z

What changes were proposed in this pull request?

This PR aims the followings.

Refactor TPCDSQueryBenchmark to use main method to improve the usability.
Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have Stdev field now. If there is an irregular run, we can notice easily with that).
Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.)
- AWS EC2 r3.xlarge with ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1) is used.

This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following.

# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests

# Generate data. (JDK8)
$ git clone git@github.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1  // This need `Spark 2.4`

Why are the changes needed?

Although the generated TPCDS data is random, we can keep the record.

Does this PR introduce any user-facing change?

No. (This is dev-only test benchmark).

How was this patch tested?

Manually run the benchmark. Please note that you need to have TPCDS data.

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1"

…thod

dongjoon-hyun · 2019-10-08T03:32:01Z

Could you review this, @maropu and @wangyum ?

HyukjinKwon · 2019-10-08T04:16:07Z

Haven't tested but given the PR description looks making sense to me.

dongjoon-hyun · 2019-10-08T04:30:07Z

Thank you for review and approval, @HyukjinKwon . This is the last one to resolve the umbrella issue, https://issues.apache.org/jira/browse/SPARK-25475 .

HyukjinKwon · 2019-10-08T04:33:22Z

Merged to master.

We don't run this in PR builder anyway.

dongjoon-hyun · 2019-10-08T04:33:41Z

Thank you so much, @HyukjinKwon !

SparkQA · 2019-10-08T07:05:01Z

Test build #111869 has finished for PR 26049 at commit 47227ab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…thod ### What changes were proposed in this pull request? This PR aims the followings. - Refactor `TPCDSQueryBenchmark` to use main method to improve the usability. - Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that). - Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.) - AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used. This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following. ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` ### Why are the changes needed? Although the generated TPCDS data is random, we can keep the record. ### Does this PR introduce any user-facing change? No. (This is dev-only test benchmark). ### How was this patch tested? Manually run the benchmark. Please note that you need to have TPCDS data. ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1" ``` Closes apache#26049 from dongjoon-hyun/SPARK-25668. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…enchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun added 2 commits October 7, 2019 16:43

[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main me…

04cad62

…thod

Add JDK8 result

47227ab

HyukjinKwon approved these changes Oct 8, 2019

View reviewed changes

HyukjinKwon closed this in cb50177 Oct 8, 2019

dongjoon-hyun deleted the SPARK-25668 branch October 8, 2019 04:38

MaxGekk mentioned this pull request Jan 6, 2020

[SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks #27078

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

c21 mentioned this pull request Dec 3, 2021

[SPARK-37455][SQL] Replace hash with sort aggregate if child is already sorted #34702

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method #26049

[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method #26049

Uh oh!

dongjoon-hyun commented Oct 8, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

HyukjinKwon commented Oct 8, 2019

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

HyukjinKwon commented Oct 8, 2019

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

SparkQA commented Oct 8, 2019

Uh oh!

Uh oh!

[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method #26049

[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method #26049

Uh oh!

Conversation

dongjoon-hyun commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

HyukjinKwon commented Oct 8, 2019

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

HyukjinKwon commented Oct 8, 2019

Uh oh!

dongjoon-hyun commented Oct 8, 2019

Uh oh!

SparkQA commented Oct 8, 2019

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 8, 2019 •

edited

Loading