Skip to content

Commit f5118f8

Browse files
[SPARK-30409][SPARK-29173][SQL][TESTS] Use NoOp datasource in SQL benchmarks
### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/10 | - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
1 parent 1f50a58 commit f5118f8

File tree

71 files changed

+5787
-3130
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+5787
-3130
lines changed

external/avro/benchmarks/AvroReadBenchmark-jdk11-results.txt

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -2,121 +2,121 @@
22
SQL Single Numeric Column Scan
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
5+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
66
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
77
SQL Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
Sum 2995 3081 121 5.3 190.4 1.0X
9+
Sum 2689 2694 7 5.8 170.9 1.0X
1010

11-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
11+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
1212
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
1313
SQL Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1414
------------------------------------------------------------------------------------------------------------------------
15-
Sum 2865 2881 23 5.5 182.2 1.0X
15+
Sum 2741 2759 26 5.7 174.2 1.0X
1616

17-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
17+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
1818
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
1919
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
2020
------------------------------------------------------------------------------------------------------------------------
21-
Sum 2919 2936 23 5.4 185.6 1.0X
21+
Sum 2736 2748 17 5.7 173.9 1.0X
2222

23-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
23+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
2424
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
2525
SQL Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
2626
------------------------------------------------------------------------------------------------------------------------
27-
Sum 3148 3262 161 5.0 200.1 1.0X
27+
Sum 3305 3317 17 4.8 210.2 1.0X
2828

29-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
29+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
3030
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
3131
SQL Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3232
------------------------------------------------------------------------------------------------------------------------
33-
Sum 2651 2721 99 5.9 168.5 1.0X
33+
Sum 2904 2952 68 5.4 184.6 1.0X
3434

35-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
35+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
3636
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
3737
SQL Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3838
------------------------------------------------------------------------------------------------------------------------
39-
Sum 2782 2854 103 5.7 176.9 1.0X
39+
Sum 3090 3093 4 5.1 196.5 1.0X
4040

4141

4242
================================================================================================
4343
Int and String Scan
4444
================================================================================================
4545

46-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
46+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
4747
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
4848
Int and String Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4949
------------------------------------------------------------------------------------------------------------------------
50-
Sum of columns 4531 4583 73 2.3 432.1 1.0X
50+
Sum of columns 5351 5365 20 2.0 510.3 1.0X
5151

5252

5353
================================================================================================
5454
Partitioned Table Scan
5555
================================================================================================
5656

57-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
57+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
5858
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
5959
Partitioned Table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6060
------------------------------------------------------------------------------------------------------------------------
61-
Data column 3084 3105 30 5.1 196.1 1.0X
62-
Partition column 3143 3164 30 5.0 199.8 1.0X
63-
Both columns 3272 3339 94 4.8 208.1 0.9X
61+
Data column 3278 3288 14 4.8 208.4 1.0X
62+
Partition column 3149 3193 62 5.0 200.2 1.0X
63+
Both columns 3198 3204 7 4.9 203.4 1.0X
6464

6565

6666
================================================================================================
6767
Repeated String Scan
6868
================================================================================================
6969

70-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
70+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
7171
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
7272
Repeated String: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
7373
------------------------------------------------------------------------------------------------------------------------
74-
Sum of string length 3249 3318 98 3.2 309.8 1.0X
74+
Sum of string length 3435 3438 5 3.1 327.6 1.0X
7575

7676

7777
================================================================================================
7878
String with Nulls Scan
7979
================================================================================================
8080

81-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
81+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
8282
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
8383
String with Nulls Scan (0.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
8484
------------------------------------------------------------------------------------------------------------------------
85-
Sum of string length 5308 5335 38 2.0 506.2 1.0X
85+
Sum of string length 5634 5650 23 1.9 537.3 1.0X
8686

87-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
87+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
8888
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
8989
String with Nulls Scan (50.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
9090
------------------------------------------------------------------------------------------------------------------------
91-
Sum of string length 4405 4429 33 2.4 420.1 1.0X
91+
Sum of string length 4725 4752 39 2.2 450.6 1.0X
9292

93-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
93+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
9494
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
9595
String with Nulls Scan (95.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
9696
------------------------------------------------------------------------------------------------------------------------
97-
Sum of string length 3256 3309 75 3.2 310.5 1.0X
97+
Sum of string length 3550 3566 23 3.0 338.6 1.0X
9898

9999

100100
================================================================================================
101101
Single Column Scan From Wide Columns
102102
================================================================================================
103103

104-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
104+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
105105
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
106106
Single Column Scan from 100 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
107107
------------------------------------------------------------------------------------------------------------------------
108-
Sum of single column 5230 5290 85 0.2 4987.4 1.0X
108+
Sum of single column 5271 5279 11 0.2 5027.0 1.0X
109109

110-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
110+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
111111
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
112112
Single Column Scan from 200 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
113113
------------------------------------------------------------------------------------------------------------------------
114-
Sum of single column 10206 10329 174 0.1 9733.1 1.0X
114+
Sum of single column 10393 10516 174 0.1 9911.3 1.0X
115115

116-
OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
116+
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
117117
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
118118
Single Column Scan from 300 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
119119
------------------------------------------------------------------------------------------------------------------------
120-
Sum of single column 15333 15365 46 0.1 14622.3 1.0X
120+
Sum of single column 15330 15343 19 0.1 14619.6 1.0X
121121

122122

0 commit comments

Comments
 (0)