-
Notifications
You must be signed in to change notification settings - Fork 525
Closed
Labels
binding/rustIssues for the Rust crateIssues for the Rust crateenhancementNew feature or requestNew feature or request
Description
Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.
I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false
in the delta log entry. An MVP might look like:
- Query the delta log for file stats relevant to the current optimize run (based on predicate)
- Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
- Combine files by looping through each input file and writing its record batches into the new output file.
- Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.
Files no longer relavant to the log may be cleaned up later by vacuum (see #97)
wseaton, MironAtHome, wnagchenghku, MrPowers, MaksGS09 and 1 more
Metadata
Metadata
Assignees
Labels
binding/rustIssues for the Rust crateIssues for the Rust crateenhancementNew feature or requestNew feature or request