Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add microbenchmarks #671

Merged
merged 14 commits into from
Jul 18, 2024
Merged

chore: Add microbenchmarks #671

merged 14 commits into from
Jul 18, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 15, 2024

Which issue does this PR close?

N/A

Rationale for this change

When running the complex TPC-* queries, it is challenging to debug why Comet is slower in some cases.

This PR adds simpler queries that focus on an individual operator or expression, such as left outer join, or hash aggregate.

What changes are included in this PR?

New queries and a PySpark app for running them

How are these changes tested?

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.88%. Comparing base (b6d868c) to head (e2e2bd3).
Report is 6 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main     #671       +/-   ##
=============================================
+ Coverage     33.65%   53.88%   +20.23%     
+ Complexity      842      811       -31     
=============================================
  Files           109      106        -3     
  Lines         42573    10238    -32335     
  Branches       9334     1916     -7418     
=============================================
- Hits          14328     5517     -8811     
+ Misses        25297     3744    -21553     
+ Partials       2948      977     -1971     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove andygrove marked this pull request as ready for review July 16, 2024 12:59
@andygrove
Copy link
Member Author

@parthchandra @kazuyukitanimura @huaxingao @viirya This is ready for review now

@kazuyukitanimura
Copy link
Contributor

Hmmm we have microbenchmarks at https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark
Wondering if we should follow the existing one.
Or is this PR for submitting the jobs to clusters

@parthchandra
Copy link
Contributor

Hmmm we have microbenchmarks at https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark Wondering if we should follow the existing one.

+1 to adding this in the existing benchmarks unless running this on a cluster is a pre-requisite. It's easier to run these queries in a code profiler if we run them locally.

Also, not necessary but could we keep the sql files in a subdirectory?

Very nice selection of queries btw.

@andygrove
Copy link
Member Author

Hmmm we have microbenchmarks at https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark

Wondering if we should follow the existing one.

Or is this PR for submitting the jobs to clusters

I wasn't aware that these benchmarks existed 🤦

I will study them tomorrow. Thanks for pointing this out.

@andygrove
Copy link
Member Author

I tried integrating into the existing suite but the queries are not running correctly and I am not sure why.

Example:

TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
add_many_decimals                                    15             21           7          0.0      Infinity       1.0X
add_many_decimals                                    13             16           3          0.0      Infinity       1.2X
add_many_decimals: Comet (Scan)                      13             15           3          0.0      Infinity       1.2X
add_many_decimals: Comet (Scan, Exec)                11             12           2          0.0      Infinity       1.4X

This query should take more than 20 seconds to run

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we have microbenchmarks already, this targets tpcds data so I think it is still valuable to have.

@andygrove
Copy link
Member Author

@parthchandra @kazuyukitanimura This is ready for another review. I can now run these new microbenchmarks from the existing test framework.

@parthchandra
Copy link
Contributor

I tried integrating into the existing suite but the queries are not running correctly and I am not sure why.

Example:

TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
add_many_decimals                                    15             21           7          0.0      Infinity       1.0X
add_many_decimals                                    13             16           3          0.0      Infinity       1.2X
add_many_decimals: Comet (Scan)                      13             15           3          0.0      Infinity       1.2X
add_many_decimals: Comet (Scan, Exec)                11             12           2          0.0      Infinity       1.4X

This query should take more than 20 seconds to run

You were using the same scale factor I presume?

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also commit the benchmark results (as we do for the other microbenchmarks)?
That way we can keep an eye open for any performance regressions.

-- specific language governing permissions and limitations
-- under the License.

-- This is testing the cost of a complex expression that will create many intermediate arrays in Comet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment looks not for this query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the query includes an expression adding many columns e.g. a+b+c+d... and we will compute this by recursively computing binary expressions and creating an intermediate array for each one, something like:

temp1 = a+b
temp2 = temp1+c
temp3 = temp2+d
...

The Spark version will generate and compile code that just performs a+b+c+d.. without creating intermediate output.

We could potentially explore using jit or introduce specialized expressions that re-use intermediate buffers in some cases.

@andygrove
Copy link
Member Author

For reference, here are the results from running with sf=100gb data on my Linux workstation.

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
add_many_decimals                                 20502          20648         208         14.0          71.2       1.0X
add_many_decimals                                 20498          20544          65         14.1          71.2       1.0X
add_many_decimals: Comet (Scan)                   28143          28161          26         10.2          97.7       0.7X
add_many_decimals: Comet (Scan, Exec)             19323          19497         246         14.9          67.1       1.1X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
add_many_integers                                  2083           2116          45        138.2           7.2       1.0X
add_many_integers                                  2146           2177          43        134.2           7.5       1.0X
add_many_integers: Comet (Scan)                    1879           1929          71        153.3           6.5       1.1X
add_many_integers: Comet (Scan, Exec)              2102           2108           8        137.0           7.3       1.0X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
agg_high_cardinality                               1280           1306          37         56.3          17.8       1.0X
agg_high_cardinality                               1135           1141           8         63.4          15.8       1.1X
agg_high_cardinality: Comet (Scan)                 1612           1656          64         44.7          22.4       0.8X
agg_high_cardinality: Comet (Scan, Exec)            770            822          74         93.5          10.7       1.7X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
agg_low_cardinality                                 255            276          16        282.2           3.5       1.0X
agg_low_cardinality                                 255            274          18        282.7           3.5       1.0X
agg_low_cardinality: Comet (Scan)                   738            745           7         97.5          10.3       0.3X
agg_low_cardinality: Comet (Scan, Exec)             191            213          16        377.4           2.6       1.3X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------
agg_sum_decimals_no_grouping                              10552          10583          44         27.3          36.6       1.0X
agg_sum_decimals_no_grouping                              10406          10450          61         27.7          36.1       1.0X
agg_sum_decimals_no_grouping: Comet (Scan)                46013          46278         375          6.3         159.8       0.2X
agg_sum_decimals_no_grouping: Comet (Scan, Exec)          13840          13956         164         20.8          48.1       0.8X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------
agg_sum_integers_no_grouping                               2220           2251          45        129.8           7.7       1.0X
agg_sum_integers_no_grouping                               2167           2245         110        132.9           7.5       1.0X
agg_sum_integers_no_grouping: Comet (Scan)                 1984           2009          37        145.2           6.9       1.1X
agg_sum_integers_no_grouping: Comet (Scan, Exec)           2090           2108          26        137.8           7.3       1.1X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------
case_when_column_or_null                                963            981          21        298.9           3.3       1.0X
case_when_column_or_null                                941            944           6        306.1           3.3       1.0X
case_when_column_or_null: Comet (Scan)                 2697           2699           4        106.8           9.4       0.4X
case_when_column_or_null: Comet (Scan, Exec)           1235           1242           9        233.2           4.3       0.8X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
case_when_scalar                                    201            213           7        359.1           2.8       1.0X
case_when_scalar                                    200            214          10        360.7           2.8       1.0X
case_when_scalar: Comet (Scan)                     1856           1889          46         38.8          25.8       0.1X
case_when_scalar: Comet (Scan, Exec)                390            409          12        184.6           5.4       0.5X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------
filter_highly_selective                                197            212          10        366.1           2.7       1.0X
filter_highly_selective                                193            205           5        372.5           2.7       1.0X
filter_highly_selective: Comet (Scan)                 1146           1148           3         62.8          15.9       0.2X
filter_highly_selective: Comet (Scan, Exec)            112            127          11        645.4           1.5       1.8X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
filter_less_selective                                198            207           6        363.4           2.8       1.0X
filter_less_selective                                196            211          11        367.2           2.7       1.0X
filter_less_selective: Comet (Scan)                  871            879          10         82.7          12.1       0.2X
filter_less_selective: Comet (Scan, Exec)            127            144          11        565.6           1.8       1.6X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
if_column_or_null                                   615            636          32        468.5           2.1       1.0X
if_column_or_null                                   599            620          26        481.3           2.1       1.0X
if_column_or_null: Comet (Scan)                     907            921          18        317.8           3.1       0.7X
if_column_or_null: Comet (Scan, Exec)              1022           1038          23        281.8           3.5       0.6X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_anti                                          4460           4530          98         16.1          62.0       1.0X
join_anti                                          3766           3907         200         19.1          52.3       1.2X
join_anti: Comet (Scan)                            4025           4112         123         17.9          55.9       1.1X
join_anti: Comet (Scan, Exec)                      4035           4085          70         17.8          56.0       1.1X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_condition                                     1604           1650          65        338.8           3.0       1.0X
join_condition                                     1614           1621          10        336.7           3.0       1.0X
join_condition: Comet (Scan)                       1466           1484          26        370.6           2.7       1.1X
join_condition: Comet (Scan, Exec)                 1688           1694          10        321.9           3.1       1.0X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
join_exploding_output                               1476           1512          51        368.1           2.7       1.0X
join_exploding_output                               1461           1487          37        371.9           2.7       1.0X
join_exploding_output: Comet (Scan)                 1346           1381          50        403.7           2.5       1.1X
join_exploding_output: Comet (Scan, Exec)           1459           1468          13        372.5           2.7       1.0X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_inner                                          213            221           5       1350.9           0.7       1.0X
join_inner                                          209            217           6       1378.2           0.7       1.0X
join_inner: Comet (Scan)                            204            208           3       1411.9           0.7       1.0X
join_inner: Comet (Scan, Exec)                      314            318           4        917.6           1.1       0.7X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_left_outer                                   88655          90623        2784          3.6         279.9       1.0X
join_left_outer                                   87672          88273         850          3.6         276.7       1.0X
join_left_outer: Comet (Scan)                     87751          89203        2054          3.6         277.0       1.0X
join_left_outer: Comet (Scan, Exec)               89906          92005        2969          3.5         283.8       1.0X

OpenJDK 64-Bit Server VM 11.0.23+9-post-Ubuntu-1ubuntu122.04.1 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_semi                                          9258           9871         867          7.8         128.6       1.0X
join_semi                                          9176           9503         462          7.8         127.4       1.0X
join_semi: Comet (Scan)                            9519           9701         257          7.6         132.2       1.0X
join_semi: Comet (Scan, Exec)                      8726           9192         659          8.3         121.2       1.1X

@parthchandra
Copy link
Contributor

Can we also commit the benchmark results (as we do for the other microbenchmarks)? That way we can keep an eye open for any performance regressions.

I was under the impression that we were committing the output of the benchmarking run as well. Looks like I'm mistaken.
FWIW, Spark has a Github Actions workflow to run benchmarks (documented here: https://spark.apache.org/developer-tools.html) that allows contributors to run benchmarks from their own github fork. We could do something similar down the road.

@parthchandra
Copy link
Contributor

TPCDS Micro Benchmarks: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

add_many_decimals 20502 20648 208 14.0 71.2 1.0X
add_many_decimals 20498 20544 65 14.1 71.2 1.0X
add_many_decimals: Comet (Scan) 28143 28161 26 10.2 97.7 0.7X
add_many_decimals: Comet (Scan, Exec) 19323 19497 246 14.9 67.1 1.1X

Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader.
From a profiling run I saw that a potential bottleneck was BosonVector.getDecimal which has an expensive creation of a BigInteger followed by an expensive creation of a BigDecimal.
However, this path would be hit only for precision > 18 or if spark.comet.use.decimal128 was set to true (it is false by default).
Also, I'm not sure if there is a way to eliminate this though.

@andygrove
Copy link
Member Author

Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader.
From a profiling run I saw that a potential bottleneck was BosonVector.getDecimal which has an expensive creation of a BigInteger followed by an expensive creation of a BigDecimal.
However, this path would be hit only for precision > 18 or if spark.comet.use.decimal128 was set to true (it is false by default).
Also, I'm not sure if there is a way to eliminate this though.

I was also trying to understand why this result was slower. I have created #679 based on your comment so that we can use that issue to explore possible optimizations

@andygrove
Copy link
Member Author

However, this path would be hit only for precision > 18 or if spark.comet.use.decimal128 was set to true (it is false by default).

The fields have precision 7 and I am not aware of spark.comet.use.decimal128 being set.

| ss_wholesale_cost     | Decimal128(7, 2) | YES         |
| ss_list_price         | Decimal128(7, 2) | YES         |
| ss_sales_price        | Decimal128(7, 2) | YES         |
| ss_ext_discount_amt   | Decimal128(7, 2) | YES         |
| ss_ext_sales_price    | Decimal128(7, 2) | YES         |
| ss_ext_wholesale_cost | Decimal128(7, 2) | YES         |
| ss_ext_list_price     | Decimal128(7, 2) | YES         |
| ss_ext_tax            | Decimal128(7, 2) | YES         |
| ss_coupon_amt         | Decimal128(7, 2) | YES         |
| ss_net_paid           | Decimal128(7, 2) | YES         |
| ss_net_paid_inc_tax   | Decimal128(7, 2) | YES         |
| ss_net_profit         | Decimal128(7, 2) | YES         |

@andygrove andygrove merged commit 896096a into apache:main Jul 18, 2024
74 checks passed
@andygrove andygrove deleted the microbenchmarks branch July 18, 2024 00:44
@viirya
Copy link
Member

viirya commented Jul 18, 2024

TPCDS Micro Benchmarks: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

add_many_decimals 20502 20648 208 14.0 71.2 1.0X
add_many_decimals 20498 20544 65 14.1 71.2 1.0X
add_many_decimals: Comet (Scan) 28143 28161 26 10.2 97.7 0.7X
add_many_decimals: Comet (Scan, Exec) 19323 19497 246 14.9 67.1 1.1X

Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader. From a profiling run I saw that a potential bottleneck was BosonVector.getDecimal which has an expensive creation of a BigInteger followed by an expensive creation of a BigDecimal. However, this path would be hit only for precision > 18 or if spark.comet.use.decimal128 was set to true (it is false by default). Also, I'm not sure if there is a way to eliminate this though.

For use_decimal128 is true, maybe we can skip a step in getDecimal: #682

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants