- 
        Couldn't load subscription status. 
- Fork 246
Description
This epic is for improving shuffle / ScanExec performance.
Issues
- Implement faster single batch encoding/decoding for use in shuffle #1189
- Allow native shuffle batch size to be configured separately from comet default batch size #1187
- Add support for lz4 compression in shuffle #1178
- Comet native shuffle reader #1125
- Can we stop copying the Arrow schema over FFI for every batch? #1115
- Possible native shuffle optimization #977
- Optimize repartitioning logic in ShuffleWriterExec using interleave_record_batch #1235
- Native shuffle double allocates memory #1448
- Native shuffle inaccurate estimate of builder memory allocation #1449
- Re-implement memory management in native shuffle writer #1446
- Columnar shuffle uses wrong memory allocator in unified memory mode #1438
- Optimize native shuffle for single partition case #1453
Context
I have been comparing Comet and Ballista performance for TPC-H q3. Both execute similar native plans. I am using the comet-parquet-exec branch which uses DataFusion's ParquetExec.
Ballista is approximately 3x faster than Comet. Given that they are executing similar DataFusion native plans, I would expect performance to be similar.
The main difference between Comet and Ballista is that Comet transfers batches between JVM and native code during shuffle operations.
Most of the native execution time in Comet is spent in ScanExec which is reading Arrow batches from the JVM using Arrow FFI.  This time was not included in our metrics prior to #1128 and #1111.
