-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Add microbenchmarks #671
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #671 +/- ##
=============================================
+ Coverage 33.65% 53.88% +20.23%
+ Complexity 842 811 -31
=============================================
Files 109 106 -3
Lines 42573 10238 -32335
Branches 9334 1916 -7418
=============================================
- Hits 14328 5517 -8811
+ Misses 25297 3744 -21553
+ Partials 2948 977 -1971 ☔ View full report in Codecov by Sentry. |
3050298
to
3b0d3f0
Compare
@parthchandra @kazuyukitanimura @huaxingao @viirya This is ready for review now |
Hmmm we have microbenchmarks at https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark |
+1 to adding this in the existing benchmarks unless running this on a cluster is a pre-requisite. It's easier to run these queries in a code profiler if we run them locally. Also, not necessary but could we keep the sql files in a subdirectory? Very nice selection of queries btw. |
I wasn't aware that these benchmarks existed 🤦 I will study them tomorrow. Thanks for pointing this out. |
I tried integrating into the existing suite but the queries are not running correctly and I am not sure why. Example:
This query should take more than 20 seconds to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we have microbenchmarks already, this targets tpcds data so I think it is still valuable to have.
@parthchandra @kazuyukitanimura This is ready for another review. I can now run these new microbenchmarks from the existing test framework. |
You were using the same scale factor I presume? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also commit the benchmark results (as we do for the other microbenchmarks)?
That way we can keep an eye open for any performance regressions.
-- specific language governing permissions and limitations | ||
-- under the License. | ||
|
||
-- This is testing the cost of a complex expression that will create many intermediate arrays in Comet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment looks not for this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the query includes an expression adding many columns e.g. a+b+c+d...
and we will compute this by recursively computing binary expressions and creating an intermediate array for each one, something like:
temp1 = a+b
temp2 = temp1+c
temp3 = temp2+d
...
The Spark version will generate and compile code that just performs a+b+c+d..
without creating intermediate output.
We could potentially explore using jit or introduce specialized expressions that re-use intermediate buffers in some cases.
For reference, here are the results from running with sf=100gb data on my Linux workstation.
|
I was under the impression that we were committing the output of the benchmarking run as well. Looks like I'm mistaken. |
Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader. |
I was also trying to understand why this result was slower. I have created #679 based on your comment so that we can use that issue to explore possible optimizations |
The fields have precision 7 and I am not aware of
|
spark/src/test/scala/org/apache/spark/sql/benchmark/CometTPCDSMicroBenchmark.scala
Outdated
Show resolved
Hide resolved
For use_decimal128 is true, maybe we can skip a step in |
Which issue does this PR close?
N/A
Rationale for this change
When running the complex TPC-* queries, it is challenging to debug why Comet is slower in some cases.
This PR adds simpler queries that focus on an individual operator or expression, such as left outer join, or hash aggregate.
What changes are included in this PR?
New queries and a PySpark app for running them
How are these changes tested?