Skip to content

Improve Spill Performance: mmap the spill files #15321

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Today when DataFusion spills files to disk, it uses the Arrow IPC format

Here is the code:

pub(crate) fn spill_record_batches(
batches: &[RecordBatch],
path: PathBuf,
schema: SchemaRef,
) -> Result<(usize, usize)> {
let mut writer = IPCStreamWriter::new(path.as_ref(), schema.as_ref())?;
for batch in batches {
writer.write(batch)?;
}
writer.finish()?;
debug!(
"Spilled {} batches of total {} rows to disk, memory released {}",
writer.num_batches,
writer.num_rows,
human_readable_size(writer.num_bytes),
);
Ok((writer.num_rows, writer.num_bytes))
}
fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> {
let file = BufReader::new(File::open(path)?);
let reader = StreamReader::try_new(file, None)?;
for batch in reader {
sender
.blocking_send(batch.map_err(Into::into))
.map_err(|e| exec_datafusion_err!("{e}"))?;
}
Ok(())
}

The IPC reader currently reads the spill files using file IO and into memory.

it is possible to use mmap to zero copy the contents of the files into memory. Here is an example of how to do so:

https://github.com/apache/arrow-rs/blob/main/arrow/examples/zero_copy_ipc.rs

Describe the solution you'd like

I would like to see if using mmap to read the spill files back in is faster

Describe alternatives you've considered

  1. Use mmap to read spill files
  2. Add / use a benchmark showing the peformance benefit of doing this

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions