Skip to content

Investigate performance tradeoff in compressing spill files #16367

Open
@ding-young

Description

@ding-young

Is your feature request related to a problem or challenge?

Part of #16065 , Related to #14078

Background

PR #16268 will introduce compression option when writing spill files in disk. These options are limited to compression options to what Arrow IPC Stream Writer supports : zstd, lz4_frame (compression level is not configurable)

To further enhance spilling execution, we need to investigate

  • CPU / I/O tradeoff when zstd or lz4_frame compression is enabled i.e. compression ratio, extra latency spent for compression
  • Current arrow ipc stream writer always write batch at a time in append_batch. In terms of compression, it is not sure yet how much single batch can benefit from compression.
  • whether we need separate Writer or Reader implementation instead of IPC Stream Writer.
  • how to introduce sort of adaptiveness.

Describe the solution you'd like

First, we need to track (or update) how many bytes are written in spill files. Datafusion currently tracks spilled_bytes as part of SpillMetrics, but it is calculated based on in memory array size, which would be different from actual spill files size especially when we compress spill files.

Second, update the benchmarks or write a separate benchmarks to see the performance characteristics. One possible way is writing out spill-related metrics to output.json when running benches like tpch with debug option. Another idea is to generate some spill files for microbenchmark testing only spill writing - reading process.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions