Closed
Description
Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.
I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false
in the delta log entry. An MVP might look like:
- Query the delta log for file stats relevant to the current optimize run (based on predicate)
- Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
- Combine files by looping through each input file and writing its record batches into the new output file.
- Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.
Files no longer relavant to the log may be cleaned up later by vacuum (see #97)