Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
9657b75
feat: support array_append (#1072)
NoeB Nov 13, 2024
c32bf0c
chore: Simplify CometShuffleMemoryAllocator to use Spark unified memo…
viirya Nov 14, 2024
f3da844
docs: Update benchmarking.md (#1085)
rluvaton-flarion Nov 14, 2024
2c832b4
feat: Require offHeap memory to be enabled (always use unified memory…
andygrove Nov 14, 2024
7cec285
test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE…
viirya Nov 15, 2024
10ef62a
Add changelog for 0.4.0 (#1089)
andygrove Nov 15, 2024
0c9a403
chore: Prepare for 0.5.0 development (#1090)
andygrove Nov 15, 2024
406ffef
build: Skip installation of spark-integration and fuzz testing modul…
parthchandra Nov 15, 2024
bfd7054
Add hint for finding the GPG key to use when publishing to maven (#1093)
andygrove Nov 15, 2024
59da6ce
docs: Update documentation for 0.4.0 release (#1096)
andygrove Nov 18, 2024
ca3a529
fix: Unsigned type related bugs (#1095)
kazuyukitanimura Nov 19, 2024
b64c13d
chore: Include first ScanExec batch in metrics (#1105)
andygrove Nov 20, 2024
19dd58d
chore: Improve CometScan metrics (#1100)
andygrove Nov 20, 2024
e602305
chore: Add custom metric for native shuffle fetching batches from JVM…
andygrove Nov 21, 2024
9990b34
feat: support array_insert (#1073)
SemyonSinchenko Nov 22, 2024
500895d
feat: enable decimal to decimal cast of different precision and scale…
himadripal Nov 22, 2024
7b1a290
docs: fix readme FGPA/FPGA typo (#1117)
gstvg Nov 24, 2024
5400fd7
fix: Use RDD partition index (#1112)
viirya Nov 25, 2024
ebdde77
fix: Various metrics bug fixes and improvements (#1111)
andygrove Dec 2, 2024
9b250c4
fix: Don't create CometScanExec for subclasses of ParquetFileFormat (…
Kimahriman Dec 2, 2024
95727aa
fix: Fix metrics regressions (#1132)
andygrove Dec 3, 2024
36a2307
docs: Add more technical detail and new diagram to Comet plugin overv…
andygrove Dec 3, 2024
2671e0c
Stop passing Java config map into native createPlan (#1101)
andygrove Dec 4, 2024
8d7bcb8
feat: Improve ScanExec native metrics (#1133)
andygrove Dec 6, 2024
587c29b
chore: Remove unused StringView struct (#1143)
andygrove Dec 6, 2024
b95dc1d
docs: Add some documentation explaining how shuffle works (#1148)
andygrove Dec 6, 2024
1c6c7a9
test: enable more Spark 4.0 tests (#1145)
kazuyukitanimura Dec 6, 2024
8d83cc1
chore: Refactor cast to use SparkCastOptions param (#1146)
andygrove Dec 6, 2024
21503ca
Enable more scenarios in CometExecBenchmark. (#1151)
mbutrovich Dec 7, 2024
73f1405
chore: Move more expressions from core crate to spark-expr crate (#1152)
andygrove Dec 9, 2024
5c45fdc
remove dead code (#1155)
andygrove Dec 10, 2024
2c1a6b9
fix: Spark 4.0-preview1 SPARK-47120 (#1156)
kazuyukitanimura Dec 11, 2024
49cf0d7
chore: Move string kernels and expressions to spark-expr crate (#1164)
andygrove Dec 12, 2024
7db9aa6
chore: Move remaining expressions to spark-expr crate + some minor re…
andygrove Dec 12, 2024
f1d0879
chore: Add ignored tests for reading complex types from Parquet (#1167)
andygrove Dec 12, 2024
b9ac78b
feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1…
andygrove Dec 17, 2024
46a28db
fix: Document enabling comet explain plan usage in Spark (4.0) (#1176)
parthchandra Dec 17, 2024
655081b
test: enabling Spark tests with offHeap requirement (#1177)
kazuyukitanimura Dec 18, 2024
e297d23
feat: Improve shuffle metrics (second attempt) (#1175)
andygrove Dec 18, 2024
0fccb59
Merge branch 'upstream_main' into merge_upstream_main
mbutrovich Dec 18, 2024
3b0bda3
Fix redundancy in Cargo.lock.
mbutrovich Dec 18, 2024
1ea24dd
Format, more post-merge cleanup.
mbutrovich Dec 18, 2024
2f4768d
Compiles
mbutrovich Dec 18, 2024
858f0de
Compiles
mbutrovich Dec 18, 2024
360c16d
Remove empty file.
mbutrovich Dec 18, 2024
f8eee9e
Attempt to fix JNI issue and native test build issues.
mbutrovich Dec 18, 2024
c13d6a0
Test Fix
parthchandra Dec 19, 2024
6814a99
Update planner.rs
mbutrovich Dec 19, 2024
a8355f0
Merge pull request #4 from parthchandra/merge_upstream_main
mbutrovich Dec 19, 2024
1630632
Merge remote-tracking branch 'upstream/main' into merge_upstream_main
mbutrovich Dec 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/actions/setup-spark-builder/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ inputs:
comet-version:
description: 'The Comet version to use for Spark'
required: true
default: '0.4.0-SNAPSHOT'
default: '0.5.0-SNAPSHOT'
runs:
using: "composite"
steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/spark_sql_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ jobs:
with:
spark-version: ${{ matrix.spark-version.full }}
spark-short-version: ${{ matrix.spark-version.short }}
comet-version: '0.4.0-SNAPSHOT' # TODO: get this from pom.xml
comet-version: '0.5.0-SNAPSHOT' # TODO: get this from pom.xml
- name: Run Spark tests
run: |
cd apache-spark
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/spark_sql_test_ansi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ jobs:
with:
spark-version: ${{ matrix.spark-version.full }}
spark-short-version: ${{ matrix.spark-version.short }}
comet-version: '0.4.0-SNAPSHOT' # TODO: get this from pom.xml
comet-version: '0.5.0-SNAPSHOT' # TODO: get this from pom.xml
- name: Run Spark tests
run: |
cd apache-spark
Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The following chart shows the time it takes to run the 22 TPC-H queries against
using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
for details of the environment used for these benchmarks.

When using Comet, the overall run time is reduced from 616 seconds to 374 seconds, a 1.6x speedup, with query 1
When using Comet, the overall run time is reduced from 615 seconds to 364 seconds, a 1.7x speedup, with query 1
running 9x faster than Spark.

Running the same queries with DataFusion standalone (without Spark) using the same number of cores results in a 3.6x
Expand All @@ -55,21 +55,21 @@ speedup compared to Spark.
Comet is not yet achieving full DataFusion speeds in all cases, but with future work we aim to provide a 2x-4x speedup
for a broader set of queries.

![](docs/source/_static/images/benchmark-results/0.3.0/tpch_allqueries.png)
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_allqueries.png)

Here is a breakdown showing relative performance of Spark, Comet, and DataFusion for each TPC-H query.

![](docs/source/_static/images/benchmark-results/0.3.0/tpch_queries_compare.png)
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_queries_compare.png)

The following charts shows how much Comet currently accelerates each query from the benchmark.

### Relative speedup

![](docs/source/_static/images/benchmark-results/0.3.0/tpch_queries_speedup_rel.png)
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_queries_speedup_rel.png)

### Absolute speedup

![](docs/source/_static/images/benchmark-results/0.3.0/tpch_queries_speedup_abs.png)
![](docs/source/_static/images/benchmark-results/0.4.0/tpch_queries_speedup_abs.png)

These benchmarks can be reproduced in any environment using the documentation in the
[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html). We encourage
Expand All @@ -80,7 +80,7 @@ Results for our benchmark derived from TPC-DS are available in the [benchmarking
## Use Commodity Hardware

Comet leverages commodity hardware, eliminating the need for costly hardware upgrades or
specialized hardware accelerators, such as GPUs or FGPA. By maximizing the utilization of commodity hardware, Comet
specialized hardware accelerators, such as GPUs or FPGA. By maximizing the utilization of commodity hardware, Comet
ensures cost-effectiveness and scalability for your Spark deployments.

## Spark Compatibility
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ docker push localhost:32000/apache/datafusion-comet-tpcbench:latest
export SPARK_MASTER=k8s://https://127.0.0.1:16443
export COMET_DOCKER_IMAGE=localhost:32000/apache/datafusion-comet-tpcbench:latest
# Location of Comet JAR within the Docker image
export COMET_JAR=/opt/spark/jars/comet-spark-spark3.4_2.12-0.2.0-SNAPSHOT.jar
export COMET_JAR=/opt/spark/jars/comet-spark-spark3.4_2.12-0.5.0-SNAPSHOT.jar

$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
Expand Down
2 changes: 1 addition & 1 deletion common/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ under the License.
<parent>
<groupId>org.apache.datafusion</groupId>
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
<version>0.4.0-SNAPSHOT</version>
<version>0.5.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
19 changes: 17 additions & 2 deletions common/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -342,8 +342,10 @@ object CometConf extends ShimCometConf {

val COMET_COLUMNAR_SHUFFLE_MEMORY_SIZE: OptionalConfigEntry[Long] =
conf("spark.comet.columnar.shuffle.memorySize")
.internal()
.doc(
"The optional maximum size of the memory used for Comet columnar shuffle, in MiB. " +
"Test-only config. This is only used to test Comet shuffle with Spark tests. " +
"The optional maximum size of the memory used for Comet columnar shuffle, in MiB. " +
"Note that this config is only used when `spark.comet.exec.shuffle.mode` is " +
"`jvm`. Once allocated memory size reaches this config, the current batch will be " +
"flushed to disk immediately. If this is not configured, Comet will use " +
Expand All @@ -355,8 +357,10 @@ object CometConf extends ShimCometConf {

val COMET_COLUMNAR_SHUFFLE_MEMORY_FACTOR: ConfigEntry[Double] =
conf("spark.comet.columnar.shuffle.memory.factor")
.internal()
.doc(
"Fraction of Comet memory to be allocated per executor process for Comet shuffle. " +
"Test-only config. This is only used to test Comet shuffle with Spark tests. " +
"Fraction of Comet memory to be allocated per executor process for Comet shuffle. " +
"Comet memory size is specified by `spark.comet.memoryOverhead` or " +
"calculated by `spark.comet.memory.overhead.factor` * `spark.executor.memory`.")
.doubleConf
Expand All @@ -365,6 +369,17 @@ object CometConf extends ShimCometConf {
"Ensure that Comet shuffle memory overhead factor is a double greater than 0")
.createWithDefault(1.0)

val COMET_COLUMNAR_SHUFFLE_UNIFIED_MEMORY_ALLOCATOR_IN_TEST: ConfigEntry[Boolean] =
conf("spark.comet.columnar.shuffle.unifiedMemoryAllocatorTest")
.doc("Whether to use Spark unified memory allocator for Comet columnar shuffle in tests." +
"If not configured, Comet will use a test-only memory allocator for Comet columnar " +
"shuffle when Spark test env detected. The test-ony allocator is proposed to run with " +
"Spark tests as these tests require on-heap memory configuration. " +
"By default, this config is false.")
.internal()
.booleanConf
.createWithDefault(false)

val COMET_COLUMNAR_SHUFFLE_BATCH_SIZE: ConfigEntry[Int] =
conf("spark.comet.columnar.shuffle.batch.size")
.internal()
Expand Down
108 changes: 108 additions & 0 deletions dev/changelog/0.4.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# DataFusion Comet 0.4.0 Changelog

This release consists of 51 commits from 10 contributors. See credits at the end of this changelog for more information.

**Fixed bugs:**

- fix: Use the number of rows from underlying arrays instead of logical row count from RecordBatch [#972](https://github.com/apache/datafusion-comet/pull/972) (viirya)
- fix: The spilled_bytes metric of CometSortExec should be size instead of time [#984](https://github.com/apache/datafusion-comet/pull/984) (Kontinuation)
- fix: Properly handle Java exceptions without error messages; fix loading of comet native library from java.library.path [#982](https://github.com/apache/datafusion-comet/pull/982) (Kontinuation)
- fix: Fallback to Spark if scan has meta columns [#997](https://github.com/apache/datafusion-comet/pull/997) (viirya)
- fix: Fallback to Spark if named_struct contains duplicate field names [#1016](https://github.com/apache/datafusion-comet/pull/1016) (viirya)
- fix: Make comet-git-info.properties optional [#1027](https://github.com/apache/datafusion-comet/pull/1027) (andygrove)
- fix: TopK operator should return correct results on dictionary column with nulls [#1033](https://github.com/apache/datafusion-comet/pull/1033) (viirya)
- fix: need default value for getSizeAsMb(EXECUTOR_MEMORY.key) [#1046](https://github.com/apache/datafusion-comet/pull/1046) (neyama)

**Performance related:**

- perf: Remove one redundant CopyExec for SMJ [#962](https://github.com/apache/datafusion-comet/pull/962) (andygrove)
- perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin [#1007](https://github.com/apache/datafusion-comet/pull/1007) (andygrove)
- perf: Cache jstrings during metrics collection [#1029](https://github.com/apache/datafusion-comet/pull/1029) (mbutrovich)

**Implemented enhancements:**

- feat: Support `GetArrayStructFields` expression [#993](https://github.com/apache/datafusion-comet/pull/993) (Kimahriman)
- feat: Implement bloom_filter_agg [#987](https://github.com/apache/datafusion-comet/pull/987) (mbutrovich)
- feat: Support more types with BloomFilterAgg [#1039](https://github.com/apache/datafusion-comet/pull/1039) (mbutrovich)
- feat: Implement CAST from struct to string [#1066](https://github.com/apache/datafusion-comet/pull/1066) (andygrove)
- feat: Use official DataFusion 43 release [#1070](https://github.com/apache/datafusion-comet/pull/1070) (andygrove)
- feat: Implement CAST between struct types [#1074](https://github.com/apache/datafusion-comet/pull/1074) (andygrove)
- feat: support array_append [#1072](https://github.com/apache/datafusion-comet/pull/1072) (NoeB)
- feat: Require offHeap memory to be enabled (always use unified memory) [#1062](https://github.com/apache/datafusion-comet/pull/1062) (andygrove)

**Documentation updates:**

- doc: add documentation interlinks [#975](https://github.com/apache/datafusion-comet/pull/975) (comphead)
- docs: Add IntelliJ documentation for generated source code [#985](https://github.com/apache/datafusion-comet/pull/985) (mbutrovich)
- docs: Update tuning guide [#995](https://github.com/apache/datafusion-comet/pull/995) (andygrove)
- docs: Various documentation improvements [#1005](https://github.com/apache/datafusion-comet/pull/1005) (andygrove)
- docs: clarify that Maven central only has jars for Linux [#1009](https://github.com/apache/datafusion-comet/pull/1009) (andygrove)
- doc: fix K8s links and doc [#1058](https://github.com/apache/datafusion-comet/pull/1058) (comphead)
- docs: Update benchmarking.md [#1085](https://github.com/apache/datafusion-comet/pull/1085) (rluvaton-flarion)

**Other:**

- chore: Generate changelog for 0.3.0 release [#964](https://github.com/apache/datafusion-comet/pull/964) (andygrove)
- chore: fix publish-to-maven script [#966](https://github.com/apache/datafusion-comet/pull/966) (andygrove)
- chore: Update benchmarks results based on 0.3.0-rc1 [#969](https://github.com/apache/datafusion-comet/pull/969) (andygrove)
- chore: update rem expression guide [#976](https://github.com/apache/datafusion-comet/pull/976) (kazuyukitanimura)
- chore: Enable additional CreateArray tests [#928](https://github.com/apache/datafusion-comet/pull/928) (Kimahriman)
- chore: fix compatibility guide [#978](https://github.com/apache/datafusion-comet/pull/978) (kazuyukitanimura)
- chore: Update for 0.3.0 release, prepare for 0.4.0 development [#970](https://github.com/apache/datafusion-comet/pull/970) (andygrove)
- chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled [#991](https://github.com/apache/datafusion-comet/pull/991) (viirya)
- chore: Make parquet reader options Comet options instead of Hadoop options [#968](https://github.com/apache/datafusion-comet/pull/968) (parthchandra)
- chore: remove legacy comet-spark-shell [#1013](https://github.com/apache/datafusion-comet/pull/1013) (andygrove)
- chore: Reserve memory for native shuffle writer per partition [#988](https://github.com/apache/datafusion-comet/pull/988) (viirya)
- chore: Bump arrow-rs to 53.1.0 and datafusion [#1001](https://github.com/apache/datafusion-comet/pull/1001) (kazuyukitanimura)
- chore: Revert "chore: Reserve memory for native shuffle writer per partition (#988)" [#1020](https://github.com/apache/datafusion-comet/pull/1020) (viirya)
- minor: Remove hard-coded version number from Dockerfile [#1025](https://github.com/apache/datafusion-comet/pull/1025) (andygrove)
- chore: Reserve memory for native shuffle writer per partition [#1022](https://github.com/apache/datafusion-comet/pull/1022) (viirya)
- chore: Improve error handling when native lib fails to load [#1000](https://github.com/apache/datafusion-comet/pull/1000) (andygrove)
- chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation [#1041](https://github.com/apache/datafusion-comet/pull/1041) (NoeB)
- chore: Refactor Arrow Array and Schema allocation in ColumnReader and MetadataColumnReader [#1047](https://github.com/apache/datafusion-comet/pull/1047) (viirya)
- minor: Refactor binary expr serde to reduce code duplication [#1053](https://github.com/apache/datafusion-comet/pull/1053) (andygrove)
- chore: Upgrade to DataFusion 43.0.0-rc1 [#1057](https://github.com/apache/datafusion-comet/pull/1057) (andygrove)
- chore: Refactor UnaryExpr and MathExpr in protobuf [#1056](https://github.com/apache/datafusion-comet/pull/1056) (andygrove)
- minor: use defaults instead of hard-coding values [#1060](https://github.com/apache/datafusion-comet/pull/1060) (andygrove)
- minor: refactor UnaryExpr handling to make code more concise [#1065](https://github.com/apache/datafusion-comet/pull/1065) (andygrove)
- chore: Refactor binary and math expression serde code [#1069](https://github.com/apache/datafusion-comet/pull/1069) (andygrove)
- chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator [#1063](https://github.com/apache/datafusion-comet/pull/1063) (viirya)
- test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config [#1087](https://github.com/apache/datafusion-comet/pull/1087) (viirya)

## Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

```
19 Andy Grove
13 Matt Butrovich
8 Liang-Chi Hsieh
3 KAZUYUKI TANIMURA
2 Adam Binford
2 Kristin Cowalcijk
1 NoeB
1 Oleks V
1 Parth Chandra
1 neyama
```

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
2 changes: 1 addition & 1 deletion dev/diffs/3.4.3.diff
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ index d3544881af1..bf0e2b53c70 100644
<ivy.version>2.5.1</ivy.version>
<oro.version>2.0.8</oro.version>
+ <spark.version.short>3.4</spark.version.short>
+ <comet.version>0.4.0-SNAPSHOT</comet.version>
+ <comet.version>0.5.0-SNAPSHOT</comet.version>
<!--
If you changes codahale.metrics.version, you also need to change
the link to metrics.dropwizard.io in docs/monitoring.md.
Expand Down
2 changes: 1 addition & 1 deletion dev/diffs/3.5.1.diff
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ index 0f504dbee85..f6019da888a 100644
<ivy.version>2.5.1</ivy.version>
<oro.version>2.0.8</oro.version>
+ <spark.version.short>3.5</spark.version.short>
+ <comet.version>0.4.0-SNAPSHOT</comet.version>
+ <comet.version>0.5.0-SNAPSHOT</comet.version>
<!--
If you changes codahale.metrics.version, you also need to change
the link to metrics.dropwizard.io in docs/monitoring.md.
Expand Down
Loading