[EPIC] Improve shuffle performance

This epic is for improving shuffle / ScanExec performance.

## Issues

- https://github.com/apache/datafusion-comet/issues/1189
- https://github.com/apache/datafusion-comet/issues/1187
- https://github.com/apache/datafusion-comet/issues/1178
- https://github.com/apache/datafusion-comet/issues/1125
- https://github.com/apache/datafusion-comet/issues/1115
- https://github.com/apache/datafusion-comet/issues/977
- https://github.com/apache/datafusion-comet/issues/1235
- https://github.com/apache/datafusion-comet/issues/1448
- https://github.com/apache/datafusion-comet/issues/1449
- https://github.com/apache/datafusion-comet/issues/1446
- https://github.com/apache/datafusion-comet/issues/1438
- https://github.com/apache/datafusion-comet/issues/1453

## Context

I have been comparing Comet and [Ballista](https://github.com/apache/datafusion-ballista) performance for TPC-H q3. Both execute similar native plans. I am using the `comet-parquet-exec` branch which uses DataFusion's `ParquetExec`.

Ballista is approximately 3x faster than Comet. Given that they are executing similar DataFusion native plans, I would expect performance to be similar. 

The main difference between Comet and Ballista is that Comet transfers batches between JVM and native code during shuffle operations.

Most of the native execution time in Comet is spent in `ScanExec` which is reading Arrow batches from the JVM using Arrow FFI.  This time was not included in our metrics prior to https://github.com/apache/datafusion-comet/pull/1128 and https://github.com/apache/datafusion-comet/pull/1111.

![Screenshot from 2024-11-26 12-30-01](https://github.com/user-attachments/assets/b3006742-406a-49d0-9cca-8b798470a630)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[EPIC] Improve shuffle performance #1123

Issues

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[EPIC] Improve shuffle performance #1123

Description

Issues

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions