Skip to content

[EPIC] Improve shuffle performance #1123

@andygrove

Description

@andygrove

This epic is for improving shuffle / ScanExec performance.

Issues

Context

I have been comparing Comet and Ballista performance for TPC-H q3. Both execute similar native plans. I am using the comet-parquet-exec branch which uses DataFusion's ParquetExec.

Ballista is approximately 3x faster than Comet. Given that they are executing similar DataFusion native plans, I would expect performance to be similar.

The main difference between Comet and Ballista is that Comet transfers batches between JVM and native code during shuffle operations.

Most of the native execution time in Comet is spent in ScanExec which is reading Arrow batches from the JVM using Arrow FFI. This time was not included in our metrics prior to #1128 and #1111.

Screenshot from 2024-11-26 12-30-01

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions