Skip to content

Implement optimize command #98

Closed
Closed
@xianwill

Description

Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.

I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false in the delta log entry. An MVP might look like:

  • Query the delta log for file stats relevant to the current optimize run (based on predicate)
  • Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
  • Combine files by looping through each input file and writing its record batches into the new output file.
  • Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.

Files no longer relavant to the log may be cleaned up later by vacuum (see #97)

Metadata

Assignees

No one assigned

    Labels

    binding/rustIssues for the Rust crateenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions