chore: Prepare 0.2.0 release (apache#866)

* update version in Dockerfile * use offical DataFusion release * Generate 0.2.0 changelog * regenreate after squashing commits * enable spark.comet.exec.enabled by default * update benchmarking configs * add new benchmark results * delete old benchmark results * add chart * Update docs/source/user-guide/configs.md Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Update common/src/main/scala/org/apache/comet/CometConf.scala Co-authored-by: Oleks V <comphead@users.noreply.github.com> * tpc-ds results * fix links --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
comphead · Aug 25, 2024 · 03a248a · 03a248a
1 parent cff7697
commit 03a248a
Show file tree

Hide file tree

Showing 26 changed files with 1,698 additions and 328 deletions.
diff --git a/README.md b/README.md
@@ -44,30 +44,32 @@ The following chart shows the time it takes to run the 22 TPC-H queries against
 using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
 for details of the environment used for these benchmarks.
 
-When using Comet, the overall run time is reduced from 649 seconds to 433 seconds, a 1.5x speedup, with some queries
-showing a 2x-3x speedup.
+When using Comet, the overall run time is reduced from 616 seconds to 379 seconds, a 1.62x speedup, with query 1
+running more than 7x faster than Spark.
 
-Running the same queries with DataFusion standalone (without Spark) using the same number of cores results in a 3.9x 
+Running the same queries with DataFusion standalone (without Spark) using the same number of cores results in a 3.6x 
 speedup compared to Spark.
 
 Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup 
 for a broader set of queries.
 
-![](docs/source/_static/images/benchmark-results/2024-07-19/tpch_allqueries.png)
+![](docs/source/_static/images/benchmark-results/2024-08-23/tpch_allqueries.png)
 
 Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query.
 
-![](docs/source/_static/images/benchmark-results/2024-07-19/tpch_queries_compare.png)
+![](docs/source/_static/images/benchmark-results/2024-08-23/tpch_queries_compare.png)
 
 The following chart shows how much Comet currently accelerates each query from the benchmark. Performance optimization
 is an ongoing task, and we welcome contributions from the community to help achieve even greater speedups in the future.
 
-![](docs/source/_static/images/benchmark-results/2024-07-19/tpch_queries_speedup.png)
+![](docs/source/_static/images/benchmark-results/2024-08-23/tpch_queries_speedup_rel.png)
 
 These benchmarks can be reproduced in any environment using the documentation in the 
 [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage 
 you to run your own benchmarks.
 
+Results for our benchmark derived from TPC-DS are available in the [benchmarking guide](https://datafusion.apache.org/comet/contributor-guide/benchmark-results/tpc-ds.html).
+
 ## Use Commodity Hardware
 
 Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or

diff --git a/common/src/main/scala/org/apache/comet/CometConf.scala b/common/src/main/scala/org/apache/comet/CometConf.scala
@@ -99,9 +99,9 @@ object CometConf extends ShimCometConf {
         "native space. Note: each operator is associated with a separate config in the " +
         "format of 'spark.comet.exec.<operator_name>.enabled' at the moment, and both the " +
         "config and this need to be turned on, in order for the operator to be executed in " +
-        "native. By default, this config is false.")
+        "native. By default, this config is true.")
     .booleanConf
-    .createWithDefault(false)
+    .createWithDefault(true)
 
   val COMET_EXEC_PROJECT_ENABLED: ConfigEntry[Boolean] =
     createExecEnabledConfig("project", defaultValue = true)

diff --git a/dev/changelog/0.2.0.md b/dev/changelog/0.2.0.md
@@ -0,0 +1,146 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataFusion Comet 0.2.0 Changelog
+
+This release consists of 87 commits from 14 contributors. See credits at the end of this changelog for more information.
+
+**Fixed bugs:**
+
+- fix: dictionary decimal vector optimization [#705](https://github.com/apache/datafusion-comet/pull/705) (kazuyukitanimura)
+- fix: Unsupported window expression should fall back to Spark [#710](https://github.com/apache/datafusion-comet/pull/710) (viirya)
+- fix: ReusedExchangeExec can be child operator of CometBroadcastExchangeExec [#713](https://github.com/apache/datafusion-comet/pull/713) (viirya)
+- fix: Fallback to Spark for window expression with range frame [#719](https://github.com/apache/datafusion-comet/pull/719) (viirya)
+- fix: Remove `skip.surefire.tests` mvn property [#739](https://github.com/apache/datafusion-comet/pull/739) (wForget)
+- fix: subquery execution under CometTakeOrderedAndProjectExec should not fail [#748](https://github.com/apache/datafusion-comet/pull/748) (viirya)
+- fix: skip negative scale checks for creating decimals [#723](https://github.com/apache/datafusion-comet/pull/723) (kazuyukitanimura)
+- fix: Fallback to Spark for unsupported partitioning [#759](https://github.com/apache/datafusion-comet/pull/759) (viirya)
+- fix: Unsupported types for SinglePartition should fallback to Spark [#765](https://github.com/apache/datafusion-comet/pull/765) (viirya)
+- fix: unwrap dictionaries in CreateNamedStruct [#754](https://github.com/apache/datafusion-comet/pull/754) (andygrove)
+- fix: Fallback to Spark for unsupported input besides ordering [#768](https://github.com/apache/datafusion-comet/pull/768) (viirya)
+- fix: Native window operator should be CometUnaryExec [#774](https://github.com/apache/datafusion-comet/pull/774) (viirya)
+- fix: Fallback to Spark when shuffling on struct with duplicate field name [#776](https://github.com/apache/datafusion-comet/pull/776) (viirya)
+- fix: withInfo was overwriting information in some cases [#780](https://github.com/apache/datafusion-comet/pull/780) (andygrove)
+- fix: Improve support for nested structs [#800](https://github.com/apache/datafusion-comet/pull/800) (eejbyfeldt)
+- fix: Sort on single struct should fallback to Spark [#811](https://github.com/apache/datafusion-comet/pull/811) (viirya)
+- fix: Check sort order of SortExec instead of child output [#821](https://github.com/apache/datafusion-comet/pull/821) (viirya)
+- fix: Fix panic in `avg` aggregate and disable `stddev` by default [#819](https://github.com/apache/datafusion-comet/pull/819) (andygrove)
+- fix: Supported nested types in HashJoin [#735](https://github.com/apache/datafusion-comet/pull/735) (eejbyfeldt)
+
+**Performance related:**
+
+- perf: Improve performance of CASE .. WHEN expressions [#703](https://github.com/apache/datafusion-comet/pull/703) (andygrove)
+- perf: Optimize IfExpr by delegating to CaseExpr [#681](https://github.com/apache/datafusion-comet/pull/681) (andygrove)
+- fix: optimize isNullAt [#732](https://github.com/apache/datafusion-comet/pull/732) (kazuyukitanimura)
+- perf: decimal decode improvements [#727](https://github.com/apache/datafusion-comet/pull/727) (parthchandra)
+- fix: Remove castting on decimals with a small precision to decimal256 [#741](https://github.com/apache/datafusion-comet/pull/741) (kazuyukitanimura)
+- fix: optimize some bit functions [#718](https://github.com/apache/datafusion-comet/pull/718) (kazuyukitanimura)
+- fix: Optimize getDecimal for small precision [#758](https://github.com/apache/datafusion-comet/pull/758) (kazuyukitanimura)
+- perf: add metrics to CopyExec and ScanExec [#778](https://github.com/apache/datafusion-comet/pull/778) (andygrove)
+- fix: Optimize decimal creation macros [#764](https://github.com/apache/datafusion-comet/pull/764) (kazuyukitanimura)
+- perf: Improve count aggregate performance [#784](https://github.com/apache/datafusion-comet/pull/784) (andygrove)
+- fix: Optimize read_side_padding [#772](https://github.com/apache/datafusion-comet/pull/772) (kazuyukitanimura)
+- perf: Remove some redundant copying of batches [#816](https://github.com/apache/datafusion-comet/pull/816) (andygrove)
+- perf: Remove redundant copying of batches after FilterExec [#835](https://github.com/apache/datafusion-comet/pull/835) (andygrove)
+- fix: Optimize CheckOverflow [#852](https://github.com/apache/datafusion-comet/pull/852) (kazuyukitanimura)
+- perf: Add benchmarks for Spark Scan + Comet Exec [#863](https://github.com/apache/datafusion-comet/pull/863) (andygrove)
+
+**Implemented enhancements:**
+
+- feat: Add support for time-zone, 3 & 5 digit years: Cast from string to timestamp. [#704](https://github.com/apache/datafusion-comet/pull/704) (akhilss99)
+- feat: Support count AggregateUDF for window function [#736](https://github.com/apache/datafusion-comet/pull/736) (huaxingao)
+- feat: Implement basic version of RLIKE [#734](https://github.com/apache/datafusion-comet/pull/734) (andygrove)
+- feat: show executed native plan with metrics when in debug mode [#746](https://github.com/apache/datafusion-comet/pull/746) (andygrove)
+- feat: Add GetStructField expression [#731](https://github.com/apache/datafusion-comet/pull/731) (Kimahriman)
+- feat: Add config to enable native upper and lower string conversion [#767](https://github.com/apache/datafusion-comet/pull/767) (andygrove)
+- feat: Improve native explain [#795](https://github.com/apache/datafusion-comet/pull/795) (andygrove)
+- feat: Add support for null literal with struct type [#797](https://github.com/apache/datafusion-comet/pull/797) (eejbyfeldt)
+- feat: Optimze CreateNamedStruct preserve dictionaries [#789](https://github.com/apache/datafusion-comet/pull/789) (eejbyfeldt)
+- feat: `CreateArray` support [#793](https://github.com/apache/datafusion-comet/pull/793) (Kimahriman)
+- feat: Add native thread configs [#828](https://github.com/apache/datafusion-comet/pull/828) (viirya)
+- feat: Add specific configs for converting Spark Parquet and JSON data to Arrow [#832](https://github.com/apache/datafusion-comet/pull/832) (andygrove)
+- feat: Support sum in window function [#802](https://github.com/apache/datafusion-comet/pull/802) (huaxingao)
+- feat: Simplify configs for enabling/disabling operators [#855](https://github.com/apache/datafusion-comet/pull/855) (andygrove)
+- feat: Enable `clippy::clone_on_ref_ptr` on `proto` and `spark_exprs` crates [#859](https://github.com/apache/datafusion-comet/pull/859) (comphead)
+- feat: Enable `clippy::clone_on_ref_ptr` on `core` crate [#860](https://github.com/apache/datafusion-comet/pull/860) (comphead)
+- feat: Use CometPlugin as main entrypoint [#853](https://github.com/apache/datafusion-comet/pull/853) (andygrove)
+
+**Documentation updates:**
+
+- doc: Update outdated spark.comet.columnar.shuffle.enabled configuration doc [#738](https://github.com/apache/datafusion-comet/pull/738) (wForget)
+- docs: Add explicit configs for enabling operators [#801](https://github.com/apache/datafusion-comet/pull/801) (andygrove)
+- doc: Document CometPlugin to start Comet in cluster mode [#836](https://github.com/apache/datafusion-comet/pull/836) (comphead)
+
+**Other:**
+
+- chore: Make rust clippy happy [#701](https://github.com/apache/datafusion-comet/pull/701) (Xuanwo)
+- chore: Update version to 0.2.0 and add 0.1.0 changelog [#696](https://github.com/apache/datafusion-comet/pull/696) (andygrove)
+- chore: Use rust-toolchain.toml for better toolchain support [#699](https://github.com/apache/datafusion-comet/pull/699) (Xuanwo)
+- chore(native): Make sure all targets in workspace been covered by clippy [#702](https://github.com/apache/datafusion-comet/pull/702) (Xuanwo)
+- Apache DataFusion Comet Logo [#697](https://github.com/apache/datafusion-comet/pull/697) (aocsa)
+- chore: Add logo to rat exclude list [#709](https://github.com/apache/datafusion-comet/pull/709) (andygrove)
+- chore: Use new logo in README and website [#724](https://github.com/apache/datafusion-comet/pull/724) (andygrove)
+- build: Add Comet logo files into exclude list [#726](https://github.com/apache/datafusion-comet/pull/726) (viirya)
+- chore: Remove TPC-DS benchmark results [#728](https://github.com/apache/datafusion-comet/pull/728) (andygrove)
+- chore: make Cast's logic reusable for other projects [#716](https://github.com/apache/datafusion-comet/pull/716) (Blizzara)
+- chore: move scalar_funcs into spark-expr [#712](https://github.com/apache/datafusion-comet/pull/712) (Blizzara)
+- chore: Bump DataFusion to rev 35c2e7e [#740](https://github.com/apache/datafusion-comet/pull/740) (andygrove)
+- chore: add more aggregate functions to benchmark test [#706](https://github.com/apache/datafusion-comet/pull/706) (huaxingao)
+- chore: Add criterion benchmark for decimal_div [#743](https://github.com/apache/datafusion-comet/pull/743) (andygrove)
+- build: Re-enable TPCDS q72 for broadcast and hash join configs [#781](https://github.com/apache/datafusion-comet/pull/781) (viirya)
+- chore: bump DataFusion to rev f4e519f [#783](https://github.com/apache/datafusion-comet/pull/783) (huaxingao)
+- chore: Upgrade to DataFusion rev bddb641 and disable "skip partial aggregates" feature [#788](https://github.com/apache/datafusion-comet/pull/788) (andygrove)
+- chore: Remove legacy code for adding a cast to a coalesce [#790](https://github.com/apache/datafusion-comet/pull/790) (andygrove)
+- chore: Use DataFusion 41.0.0-rc1 [#794](https://github.com/apache/datafusion-comet/pull/794) (andygrove)
+- chore: rename `CometRowToColumnar` and fix duplication bug [#785](https://github.com/apache/datafusion-comet/pull/785) (Kimahriman)
+- chore: Enable shuffle in micro benchmarks [#806](https://github.com/apache/datafusion-comet/pull/806) (andygrove)
+- Minor: ScanExec code cleanup and additional documentation [#804](https://github.com/apache/datafusion-comet/pull/804) (andygrove)
+- chore: Make it possible to run 'make benchmark-%' using jvm 17+ [#823](https://github.com/apache/datafusion-comet/pull/823) (eejbyfeldt)
+- chore: Add more unsupported cases to supportedSortType [#825](https://github.com/apache/datafusion-comet/pull/825) (viirya)
+- chore: Enable Comet shuffle with AQE coalesce partitions [#834](https://github.com/apache/datafusion-comet/pull/834) (viirya)
+- chore: Add GitHub workflow to publish Docker image [#847](https://github.com/apache/datafusion-comet/pull/847) (andygrove)
+- chore: Revert "fix: change the not exists base image apache/spark:3.4.3 to 3.4.2" [#854](https://github.com/apache/datafusion-comet/pull/854) (haoxins)
+- chore: fix docker-publish attempt 1 [#851](https://github.com/apache/datafusion-comet/pull/851) (andygrove)
+- minor: stop warning that AQEShuffleRead cannot run natively [#842](https://github.com/apache/datafusion-comet/pull/842) (andygrove)
+- chore: Improve ObjectHashAggregate fallback error message [#849](https://github.com/apache/datafusion-comet/pull/849) (andygrove)
+- chore: Fix docker image publishing (specify ghcr.io in tag) [#856](https://github.com/apache/datafusion-comet/pull/856) (andygrove)
+- chore: Use Git tag as Comet version when publishing Docker images [#857](https://github.com/apache/datafusion-comet/pull/857) (andygrove)
+
+## Credits
+
+Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
+
+```
+    36	Andy Grove
+    16	Liang-Chi Hsieh
+     9	KAZUYUKI TANIMURA
+     5	Emil Ejbyfeldt
+     4	Huaxin Gao
+     3	Adam Binford
+     3	Oleks V
+     3	Xuanwo
+     2	Arttu
+     2	Zhen Wang
+     1	Akhil S S
+     1	Alexander Ocsa
+     1	Parth Chandra
+     1	Xin Hao
+```
+
+Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
diff --git a/dev/release/README.md b/dev/release/README.md
@@ -74,7 +74,7 @@ example generates a change log of all changes between the previous version and t
 
 ```shell
 export GITHUB_TOKEN=<your-token-here>
-python3 generate-changelog.py 52241f44315fd1b2fd6cd9031bb05f046fe3a5a3 branch-0.1 0.0.0 > ../changelog/0.1.0.md
+python3 generate-changelog.py 0.0.0 HEAD 0.1.0 > ../changelog/0.1.0.md
 ```
 
 Create a PR against the _main_ branch to add this change log and once this is approved and merged, cherry-pick the

diff --git a/dev/release/generate-changelog.py b/dev/release/generate-changelog.py
@@ -77,10 +77,10 @@ def generate_changelog(repo, repo_name, tag1, tag2, version):
         labels = [label.name for label in pull.labels]
         if 'api change' in labels or cc_breaking:
             breaking.append((pull, commit))
-        elif 'bug' in labels or cc_type == 'fix':
-            bugs.append((pull, commit))
         elif 'performance' in labels or cc_type == 'perf':
             performance.append((pull, commit))
+        elif 'bug' in labels or cc_type == 'fix':
+            bugs.append((pull, commit))
         elif 'enhancement' in labels or cc_type == 'feat':
             enhancements.append((pull, commit))
         elif 'documentation' in labels or cc_type == 'docs' or cc_type == 'doc':
@@ -123,9 +123,9 @@ def generate_changelog(repo, repo_name, tag1, tag2, version):
           f"See credits at the end of this changelog for more information.\n")
 
     print_pulls(repo_name, "Breaking changes", breaking)
+    print_pulls(repo_name, "Fixed bugs", bugs)
     print_pulls(repo_name, "Performance related", performance)
     print_pulls(repo_name, "Implemented enhancements", enhancements)
-    print_pulls(repo_name, "Fixed bugs", bugs)
     print_pulls(repo_name, "Documentation updates", docs)
     print_pulls(repo_name, "Other", other)
 

diff --git a/docs/source/_static/images/benchmark-results/2024-07-19/tpch_allqueries.png b/docs/source/_static/images/benchmark-results/2024-07-19/tpch_allqueries.png
diff --git a/docs/source/_static/images/benchmark-results/2024-07-19/tpch_queries_compare.png b/docs/source/_static/images/benchmark-results/2024-07-19/tpch_queries_compare.png
diff --git a/docs/source/_static/images/benchmark-results/2024-07-19/tpch_queries_speedup.png b/docs/source/_static/images/benchmark-results/2024-07-19/tpch_queries_speedup.png
diff --git a/docs/source/_static/images/benchmark-results/2024-08-23/tpcds_allqueries.png b/docs/source/_static/images/benchmark-results/2024-08-23/tpcds_allqueries.png
diff --git a/docs/source/_static/images/benchmark-results/2024-08-23/tpcds_queries_compare.png b/docs/source/_static/images/benchmark-results/2024-08-23/tpcds_queries_compare.png
diff --git a/...ource/_static/images/benchmark-results/2024-08-23/tpcds_queries_speedup_abs.png b/...ource/_static/images/benchmark-results/2024-08-23/tpcds_queries_speedup_abs.png
diff --git a/docs/source/_static/images/benchmark-results/2024-08-23/tpch_allqueries.png b/docs/source/_static/images/benchmark-results/2024-08-23/tpch_allqueries.png
diff --git a/docs/source/_static/images/benchmark-results/2024-08-23/tpch_queries_compare.png b/docs/source/_static/images/benchmark-results/2024-08-23/tpch_queries_compare.png
diff --git a/...source/_static/images/benchmark-results/2024-08-23/tpch_queries_speedup_rel.png b/...source/_static/images/benchmark-results/2024-08-23/tpch_queries_speedup_rel.png
diff --git a/docs/source/contributor-guide/benchmark-results/2024-07-19/datafusion-tpch.json b/docs/source/contributor-guide/benchmark-results/2024-07-19/datafusion-tpch.json