Skip to content

Add cleanup support for partition-level statistics files when DROP TABLE PURGE #1508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 9, 2025

Conversation

danielhumanmod
Copy link
Contributor

Motivation

As a follow-up PR for #312

Previously, when DROP TABLE PURGE was issued, Polaris cleaned up data files, manifest files, and metadata files, but did not clean up partition-level statistics files.

Current Behavior

Partition statistics files (partition_stats) remain in storage after the table is dropped. These files are listed in the TableMetadata but were not included in the batch deletion task, resulting in orphaned files.

Changes Introduced

  • Added support for including partitionStatisticsFiles from TableMetadata in the batch cleanup task (BatchFileCleanupTaskHandler).
  • Updated getMetadataFileBatches() to collect and batch partition statistics files for deletion.
  • Added test coverage in TableCleanupTaskHandlerTest and BatchFileCleanupTaskHandlerTest to verify:
    • partitionStats files are scheduled for deletion
    • they are correctly deleted by the task handler

Desired Outcome

After a DROP TABLE PURGE, all Iceberg table metadata including partition-level statistics are cleaned up as expected.

tableMetadata.statisticsFiles().stream().map(StatisticsFile::path),
tableMetadata.partitionStatisticsFiles().stream()
.map(PartitionStatisticsFile::path))
.flatMap(s -> s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line achieve? Does tableMetadata.partitionStatisticsFiles() return back a Stream of Streams?

Copy link
Contributor Author

@danielhumanmod danielhumanmod May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

  1. what tableMetadata.partitionStatisticsFiles() return is List<String>
  2. The reason of is line is: Each call to .stream().map(...) is producing a Stream<String>, so by doing Stream.of(...), we end up with a Stream<Stream<String>>. The .flatMap(s -> s) is there to flatten it all into one Stream before filtering.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that makes sense - I missed the change to Stream.of().

tableMetadata.statisticsFiles().stream().map(StatisticsFile::path),
tableMetadata.partitionStatisticsFiles().stream()
.map(PartitionStatisticsFile::path))
.flatMap(s -> s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that makes sense - I missed the change to Stream.of().

@@ -243,12 +243,13 @@ private Stream<TaskEntity> getMetadataTaskStream(
private List<List<String>> getMetadataFileBatches(TableMetadata tableMetadata, int batchSize) {
List<List<String>> result = new ArrayList<>();
List<String> metadataFiles =
Copy link
Contributor

@eric-maynard eric-maynard May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really accurate to call puffin files metadata files, and is it necessarily correct to group all of these together? I guess the intent here is to collect all of the not-data files?

Copy link
Contributor Author

@danielhumanmod danielhumanmod May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really accurate to call puffin files metadata files, and is it necessarily correct to group all of these together? I guess the intent here is to collect all of the not-data files?

Good point — the original intention was to group files under the metadata/ directory to reduce the overhead of scheduling tasks. As more file types like stats and partition stats were added later, nonDataFiles (or something similar) might now better reflect what’s being collected.

Curious to hear your thoughts — would it be clearer to separate them, or is the performance benefit of grouping still preferred?

Copy link
Contributor

@eric-maynard eric-maynard May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... continuing to separate out the data files makes sense to me.

I think this came up on one of the previous PRs, but the real solution here needs to eventually involve moving this purge work out of the catalog server and into the maintenance service where we handle compaction etc. That's the only way to really achieve scalability.

@@ -112,7 +113,6 @@ public boolean handleTask(TaskEntity cleanupTask, CallContext callContext) {
metaStoreManager,
polarisCallContext);

// TODO: handle partition statistics files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these weren't left out due to an oversight but rather they were intentionally excluded. I'm curious if there is any background on why that is -- is there some specific pitfall related to cleaning up the partition stats?

Copy link
Contributor Author

@danielhumanmod danielhumanmod May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these weren't left out due to an oversight but rather they were intentionally excluded. I'm curious if there is any background on why that is -- is there some specific pitfall related to cleaning up the partition stats?

Good catch — to the best of my knowledge, Polaris drop table prune currently has a gap compared to Iceberg's implementation due to some reasons (which I don't know either, curious too)

Iceberg will delete all file types under the metadata/ directory, including manifests, manifest lists, metadata files, previous metadata, and .stats files (both table and partition-level). Iceberg code pointer: CatalogUtil.java#L124 for reference.

This gap also discussed earlier in this issue comment.

Happy to learn more if there’s additional context I missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, interesting. Thanks for sharing that link and that context

eric-maynard
eric-maynard previously approved these changes May 3, 2025
Copy link
Contributor

@eric-maynard eric-maynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it makes sense to include these in the result of getMetadataFileBatches

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 3, 2025
eric-maynard
eric-maynard previously approved these changes May 3, 2025
Stream.of(
secondMetadata.previousFiles().stream().map(TableMetadata.MetadataLogEntry::file),
secondMetadata.statisticsFiles().stream().map(StatisticsFile::path),
firstMetadata.partitionStatisticsFiles().stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the secondMetadata. partitionStatisticsFiles() is enough here as it contains the entries for all the snapshots?

similar to statisticsFiles() that exists already.

long snapshotId, long snapshotSequenceNumber, String statsLocation, FileIO fileIO)
throws IOException {

try (PuffinWriter puffinWriter = Puffin.write(fileIO.newOutputFile(statsLocation)).build()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partition stats is not exactly a puffin file.

It can be a file in table default format. Like parquet, avro or ORC.
Maybe we can write a dummy avro or parquet file.

more context from Iceberg repo: https://github.com/apache/iceberg/blob/696a72c0f88c3af1096e716b196f1609da34e50d/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestExpireSnapshotsProcedure.java#L500

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the test, it looks like anything can work so long as it's picked up by TableMetadata::partitionStatisticsFiles, is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, anything can work. Just doesn't want to leave an impression that it is a puffin file. We can create a dummy file as .parquet extension as I shared above. It is very minimal code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, anything can work. Just doesn't want to leave an impression that it is a puffin file. We can create a dummy file as .parquet extension as I shared above. It is very minimal code.

Thanks for catching it, I have adjust the writing impl with parquet as the file format for partition stat, follow the impl in https://github.com/apache/iceberg/blob/main/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ProcedureUtil.java#L33

Copy link
Member

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eric-maynard eric-maynard merged commit e276791 into apache:main May 9, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 9, 2025
snazy added a commit to snazy/polaris that referenced this pull request May 22, 2025
* Doc: Add release guide on the website (apache#1539)

* main: Update actions/stale digest to f78de97 (apache#1547)

* main: Update dependency boto3 to v1.38.12 (apache#1548)

* main: Update postgres Docker tag to v17.5 (apache#1549)

* main: Update dependency com.adobe.testing:s3mock-testcontainers to v4.2.0 (apache#1551)

* main: Update dependency com.nimbusds:nimbus-jose-jwt to v10.3 (apache#1552)

* Interface changes for pagination (apache#1528)

* add missing apis

* more tests, fixes

* clean up drop

* autolint

* changes per review

* revert iceberg messages to comply with oss tests

* another revert

* more iceberg catalog changes

* autolint

* dependency issues

* more wiring

* continuing rebase

* remaining issues are related to task loading

* re-add tests

* debugging

* fix failing tests

* fix another test

* changes per review

* autolint

* some fixes

* stable

* updates for new persistence

* fix

* continuing work

* more reverts

* continue reverts

* more reverts

* yank tests

* autolint

* test reverts

* try to support limit without real page tokens

* autolint

* Stable

* change comment

* autolint

* remove catalog config for now

* changes per review

* more tweaks

* simplify types per review

* Stable, about to refactor more

* re-stable

* polish

* autolint

* more changes per review

* stable

* Introduce reserved-properties setting; reserve "polaris." by default (apache#1417)

* initial commit

* initial commit

* try to test

* quarkus fixes

* fix a bunch of callsites

* Start applying changes

* autolint

* chase todos

* autolint

* bugfix

* stable

* add one test

* stable with more tests

* autolint

* more tests

* autolint

* stable tests

* clean up

* oops

* stabilize on main

* autolint

* more changes per review

* Add cleanup support for partition-level statistics files when `DROP TABLE PURGE` (apache#1508)

* cleaning up partition stats

* update partition stat file extension

* update test partition stat write impl

* Implement federation to HadoopCatalog (apache#1466)

* wip

* quarkus fixes

* autolint

* hadoop impl

* autolint

* Refactors

* refactored

* autolint

* add config

* autolint

* stable

* Remove breakpoint anchor

* add line to application.properties

* yank HADOOP

* autolint

* Spark: Use builder for CreateGenericTableRequest instead of constructor for easier API spec update (apache#1546)

* main: Update docker.io/jaegertracing/all-in-one Docker tag to v1.69.0 (apache#1559)

* main: Update dependency io.opentelemetry:opentelemetry-bom to v1.50.0 (apache#1558)

* main: Update dependency software.amazon.awssdk:bom to v2.31.40 (apache#1567)

* main: Update dependency boto3 to v1.38.13 (apache#1556)

* Include DISCLAIMER in binary distributions (apache#1568)

* NoSQL: merge/rebase-fixes

* NoSQL: bump dependencies

* fix markdown

---------

Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: danielhumanmod <danieltu.life@gmail.com>
Co-authored-by: gh-yzou <167037035+gh-yzou@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants