chore: Comet parquet exec merge from main(20250114) #1293

parthchandra · 2025-01-15T22:58:19Z

Brings comet-parquet-exec almost up to date with main
There are three new test failures which will be addressed in subsequent PRs -

- Broadcast HashJoin without join filter *** FAILED *** (467 milliseconds)
- Broadcast HashJoin with join filter *** FAILED *** (464 milliseconds)
- bucketed table *** FAILED *** (934 milliseconds)

* feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+

…ry allocator (apache#1063)

apache#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports

… config (apache#1087)

* Update version number for build * update docs

apache#1091)

…che#1093)

* update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72

## Which issue does this PR close? Closes apache#1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types

* include first batch in ScanExec metrics * record row count metric * fix regression

* Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows

…apache#1108)

* Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <agrove@apache.org> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <agrove@apache.org>

…apache#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <agrove@apache.org> * fix the nit in comment --------- Co-authored-by: himadripal <hpal@apple.com> Co-authored-by: Andy Grove <agrove@apache.org>

* fix: Use RDD partition index * fix * fix * fix

…pache#1129) * Use exact class comparison for parquet scan * Add test * Add comment

* fix metrics issues * clippy * update tests

…iew (apache#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback

* save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback

* Remove unused StringView struct * remove more dead code

* add some notes on shuffle * reads * improve docs

## Which issue does this PR close? Part of apache#372 and apache#551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled

* Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches

…che#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports

…d add LZ4 & Snappy support (apache#1192) * Implement native decoding and decompression * revert some variable renaming for smaller diff * fix oom issues? * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator * fix oom and prep for review * format * Add LZ4 support * clippy, new benchmark * rename metrics, clean up lz4 code * update test * Add support for snappy * format * change default back to lz4 * make metrics more accurate * format * clippy * use faster unsafe version of lz4_flex * Make compression codec configurable for columnar shuffle * clippy * fix bench * fmt * address feedback * address feedback * address feedback * minor code simplification * cargo fmt * overflow check * rename compression level config * address feedback * address feedback * rename constant

…ng (apache#1224) * extract agg_funcs expressions to folders based on spark grouping * fix rebase

…apache#1222) Co-authored-by: Andy Grove <agrove@apache.org>

…pache#1215)

…ark grouping (apache#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <agrove@apache.org>

…ache#1220) Co-authored-by: Andy Grove <agrove@apache.org>

## Which issue does this PR close? ## Rationale for this change Because `isCometShuffleEnabled` is false by default, some tests were not reached ## What changes are included in this PR? Removed `isCometShuffleEnabled` and updated spark test diff ## How are these changes tested? existing test

…ing (apache#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <agrove@apache.org>

… in window aggregates (apache#1253)

…huffle when native shuffle is not supported (apache#1209)

* wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array

* fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment

…formance (apache#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

* fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <agrove@apache.org> * address review comments --------- Co-authored-by: Andy Grove <agrove@apache.org>

* Add changelog * revert accidental change * move 2 items to performance section

* fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment

andygrove · 2025-01-15T23:40:26Z

TPC-H times:

comet_native 327.14
comet_datafusion 341.61
comet_iceberg_compat 297.71 🔥 (first sub-300 timing I have seen)

Our published time for 0.5.0 is 331 s

andygrove

I see 3 failing tests in CI, as expected

Tests: succeeded 796, failed 3, canceled 0, ignored 50, pending 0
*** 3 TESTS FAILED ***

NoeB and others added 30 commits November 13, 2024 16:57

chore: Simplify CometShuffleMemoryAllocator to use Spark unified memo…

c32bf0c

…ry allocator (apache#1063)

docs: Update benchmarking.md (apache#1085)

f3da844

feat: Require offHeap memory to be enabled (always use unified memory) (

2c832b4

apache#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports

test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE…

7cec285

… config (apache#1087)

Add changelog for 0.4.0 (apache#1089)

10ef62a

chore: Prepare for 0.5.0 development (apache#1090)

0c9a403

* Update version number for build * update docs

build: Skip installation of spark-integration and fuzz testing modules (

406ffef

apache#1091)

Add hint for finding the GPG key to use when publishing to maven (apa…

bfd7054

…che#1093)

docs: Update documentation for 0.4.0 release (apache#1096)

59da6ce

* update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72

chore: Include first ScanExec batch in metrics (apache#1105)

b64c13d

* include first batch in ScanExec metrics * record row count metric * fix regression

chore: Improve CometScan metrics (apache#1100)

19dd58d

* Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows

chore: Add custom metric for native shuffle fetching batches from JVM (…

e602305

…apache#1108)

docs: fix readme FGPA/FPGA typo (apache#1117)

7b1a290

fix: Use RDD partition index (apache#1112)

5400fd7

* fix: Use RDD partition index * fix * fix * fix

fix: Various metrics bug fixes and improvements (apache#1111)

ebdde77

fix: Don't create CometScanExec for subclasses of ParquetFileFormat (a…

9b250c4

…pache#1129) * Use exact class comparison for parquet scan * Add test * Add comment

fix: Fix metrics regressions (apache#1132)

95727aa

* fix metrics issues * clippy * update tests

docs: Add more technical detail and new diagram to Comet plugin overv…

36a2307

…iew (apache#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback

Stop passing Java config map into native createPlan (apache#1101)

2671e0c

feat: Improve ScanExec native metrics (apache#1133)

8d7bcb8

* save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback

chore: Remove unused StringView struct (apache#1143)

587c29b

* Remove unused StringView struct * remove more dead code

docs: Add some documentation explaining how shuffle works (apache#1148)

b95dc1d

* add some notes on shuffle * reads * improve docs

chore: Refactor cast to use SparkCastOptions param (apache#1146)

8d83cc1

* Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches

Enable more scenarios in CometExecBenchmark. (apache#1151)

21503ca

chore: Move more expressions from core crate to spark-expr crate (apa…

73f1405

…che#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports

andygrove and others added 26 commits January 6, 2025 17:47

chore: extract agg_funcs expressions to folders based on spark groupi…

3f0d442

…ng (apache#1224) * extract agg_funcs expressions to folders based on spark grouping * fix rebase

extract datetime_funcs expressions to folders based on spark grouping (…

4cf840f

…apache#1222) Co-authored-by: Andy Grove <agrove@apache.org>

chore: use datafusion from crates.io (apache#1232)

508db06

chore: extract strings file to strings_func like in spark grouping (a…

c19202c

…pache#1215)

chore: extract predicate_functions expressions to folders based on sp…

fbcf025

…ark grouping (apache#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <agrove@apache.org>

build(deps): bump protobuf version to 3.21.12 (apache#1234)

ca7b4a8

extract json_funcs expressions to folders based on spark grouping (ap…

c6acc9d

…ache#1220) Co-authored-by: Andy Grove <agrove@apache.org>

chore: extract hash_funcs expressions to folders based on spark group…

e731b6e

…ing (apache#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <agrove@apache.org>

fix: Fall back to Spark for unsupported partition or sort expressions…

be48839

… in window aggregates (apache#1253)

perf: Improve query planning to more reliably fall back to columnar s…

d15d051

…huffle when native shuffle is not supported (apache#1209)

fix regression (apache#1259)

d52038e

fix: Fall back to Spark for distinct aggregates (apache#1262)

e8261fb

* fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment

docs: Update TPC-H benchmark results (apache#1257)

1eb932a

fix: disable initCap by default (apache#1276)

9fe5420

* fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <agrove@apache.org> * address review comments --------- Co-authored-by: Andy Grove <agrove@apache.org>

chore: Add changelog for 0.5.0 (apache#1278)

cbe50e1

* Add changelog * revert accidental change * move 2 items to performance section

update TPC-DS results for 0.5.0 (apache#1277)

08d892a

fix: cast timestamp to decimal is unsupported (apache#1281)

9c1f0ee

* fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment

Merge branch 'main' into comet-parquet-exec

017963a

Fix build after merge

285396c

Fix tests after merge

b3703f5

Fix plans after merge

2c83bdd

fix partition id in execute plan after merge (from Andy Grove)

79717b8

andygrove approved these changes Jan 16, 2025

View reviewed changes

parthchandra changed the title ~~Comet parquet exec merge from main(20250114)~~ chore: Comet parquet exec merge from main(20250114) Jan 16, 2025

andygrove merged commit 32b9338 into apache:comet-parquet-exec Jan 16, 2025
68 of 150 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

chore: Comet parquet exec merge from main(20250114) #1293

chore: Comet parquet exec merge from main(20250114) #1293

Uh oh!

parthchandra commented Jan 15, 2025

Uh oh!

andygrove commented Jan 15, 2025

Uh oh!

andygrove left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Uh oh!

chore: Comet parquet exec merge from main(20250114) #1293

chore: Comet parquet exec merge from main(20250114) #1293

Uh oh!

Conversation

parthchandra commented Jan 15, 2025

Uh oh!

andygrove commented Jan 15, 2025

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants