Implement write.metadata.delete-after-commit.enabled to clean up old metadata files#1607
Conversation
Fokko
left a comment
There was a problem hiding this comment.
Thanks for adding this @kaushiksrini 🙌 This looks great.
I have one minor style comment. Could you also add this to the docs? Thanks!
tests/catalog/test_sql.py
Outdated
| transaction = table.transaction() | ||
| update = transaction.update_schema() | ||
| update.add_column(path=f"new_column_{i}", field_type=IntegerType()) | ||
| update.commit() | ||
| transaction.commit_transaction() |
There was a problem hiding this comment.
Could you use the context managers? I think that's easier to read
| transaction = table.transaction() | |
| update = transaction.update_schema() | |
| update.add_column(path=f"new_column_{i}", field_type=IntegerType()) | |
| update.commit() | |
| transaction.commit_transaction() | |
| with table.transaction() as transaction: | |
| with transaction.update_schema() as update: | |
| update.add_column(path=f"new_column_{i}", field_type=IntegerType()) |
|
@Fokko thanks! used context managers and added to the documentation |
| | `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||
| | `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.metadata.delete-after-commit.enabled` | Boolean | False | Whether to automatically delete old *tracked* metadata files after each table commit. It will retain a number of the most recent metadata files, which can be set using property `write.metadata.previous-versions-max`. | |
There was a problem hiding this comment.
pyiceberg/catalog/__init__.py
Outdated
| if current_table is not None: | ||
| self._delete_old_metadata(io, current_table.metadata, updated_metadata) |
There was a problem hiding this comment.
is this the right place for this operation?
looking at the java implementation, changes are committed to the table before old metadata files are deleted
https://github.com/apache/iceberg/blob/d935460bbed5da0ac205a67912b70fa682ee84df/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L125-L126
At this point, changes are only applied to the table metadata but is not yet committed
There was a problem hiding this comment.
the java implementation uses deleteAfterCommit
kevinjqliu
left a comment
There was a problem hiding this comment.
_delete_old_metadata should be after table is committed
|
@kevinjqliu thanks for the review! I moved the logic after the table is committed |
kevinjqliu
left a comment
There was a problem hiding this comment.
Generally LGTM! Thanks for the PR. I think _do_commit is the right place to add this.
pyiceberg/table/__init__.py
Outdated
|
|
||
| # https://github.com/apache/iceberg/blob/f6faa58/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L527 | ||
| # delete old metadata if METADATA_DELETE_AFTER_COMMIT_ENABLED is set to true | ||
| self.catalog._delete_old_metadata(self.io, self.metadata, response.metadata) |
There was a problem hiding this comment.
can we add a comment here explaining how METADATA_PREVIOUS_VERSIONS_MAX is taken into account?
There was a problem hiding this comment.
also maybe we want to wrap this in try/catch and throw a warning as to not block the commit process
There was a problem hiding this comment.
added a try-catch, but the deleteFiles already has a warning for actually deleting the files. I could only see an exception arise from _delete_old_metadata's operations.
iceberg-python/pyiceberg/catalog/__init__.py
Lines 254 to 269 in dd175aa
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! Thanks for following up
Fokko
left a comment
There was a problem hiding this comment.
I've tested this locally, and it works great. Thanks @kaushiksrini for working on this, and thanks @kevinjqliu for the review 👍
|
@Fokko thanks for the review! and @kevinjqliu thanks for making the changes + review! |
|
Thanks @kaushiksrini I applied the simple test changes via github. Thanks @Fokko for the review |
write.metadata.delete-after-commit.enabled to clean up old metadata files
write.metadata.delete-after-commit.enabled to clean up old metadata fileswrite.metadata.delete-after-commit.enabled to clean up old metadata files
Implements property
write.metadata.delete-after-commit.enabledfrom https://iceberg.apache.org/docs/1.5.1/maintenance/#remove-old-metadata-files.Closes #1199