Implement optimize command

Databricks delta lake provides an [optimize command](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html) that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.

I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting `dataChange: false` in the delta log entry. An MVP might look like:

* Query the delta log for file stats relevant to the current optimize run (based on predicate)
* Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
* Combine files by looping through each input file and writing its record batches into the new output file.
* Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.

 Files no longer relavant to the log may be cleaned up later by vacuum (see https://github.com/delta-io/delta-rs/issues/97)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement optimize command #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement optimize command #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions