Skip to content

Core: Keep track of data files to be removed for orphaned DV detection #13222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Jun 3, 2025

This keeps track of all data files to be removed/rewritten in MergingSnapshotProducer and passes those to the ManifestFilterManager for deletes. Once ManifestFilterManager goes through delete manifests, it also checks whether a DV references any of the data files to be removed.
This is needed in addition to #13245 so that we can properly remove orphaned DVs when e.g. a metadata-only delete is performed.

The below benchmark shows that tracking data files to be removed and then detecting orphaned DVs when delete manifests are looked at is only adding a small fraction to the throughput.

without tracking data files to be removed

Benchmark                                   (numFiles)  (percentDataFilesRewritten)  Mode  Cnt   Score   Error  Units
RewriteDataFilesBenchmark.rewriteDataFiles       50000                            5    ss    5   0.497 ± 0.061   s/op
RewriteDataFilesBenchmark.rewriteDataFiles       50000                           25    ss    5   0.649 ± 0.080   s/op
RewriteDataFilesBenchmark.rewriteDataFiles       50000                           50    ss    5   1.889 ± 0.096   s/op
RewriteDataFilesBenchmark.rewriteDataFiles       50000                          100    ss    5   2.093 ± 0.125   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                            5    ss    5   0.503 ± 0.040   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                           25    ss    5   1.941 ± 0.154   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                           50    ss    5   2.139 ± 0.165   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                          100    ss    5   2.474 ± 0.149   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                            5    ss    5   1.054 ± 0.067   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                           25    ss    5   2.577 ± 0.247   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                           50    ss    5   3.318 ± 1.121   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                          100    ss    5   5.792 ± 1.725   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                            5    ss    5   1.352 ± 0.122   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                           25    ss    5   3.252 ± 0.325   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                           50    ss    5   4.887 ± 0.548   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                          100    ss    5   8.297 ± 1.991   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                            5    ss    5   2.536 ± 0.232   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                           25    ss    5   5.227 ± 1.042   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                           50    ss    5   7.545 ± 2.052   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                          100    ss    5  18.058 ± 4.773   s/op

with tracking data files to be removed

RewriteDataFilesBenchmark.rewriteDataFiles       50000                            5    ss    5   0.626 ± 0.080   s/op
RewriteDataFilesBenchmark.rewriteDataFiles       50000                           25    ss    5   0.813 ± 0.081   s/op
RewriteDataFilesBenchmark.rewriteDataFiles       50000                           50    ss    5   2.013 ± 0.037   s/op
RewriteDataFilesBenchmark.rewriteDataFiles       50000                          100    ss    5   2.316 ± 0.137   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                            5    ss    5   0.676 ± 0.036   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                           25    ss    5   2.036 ± 0.105   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                           50    ss    5   2.125 ± 0.107   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      100000                          100    ss    5   2.856 ± 0.138   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                            5    ss    5   1.319 ± 0.049   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                           25    ss    5   3.314 ± 0.265   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                           50    ss    5   4.219 ± 0.498   s/op
RewriteDataFilesBenchmark.rewriteDataFiles      500000                          100    ss    5   6.164 ± 0.736   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                            5    ss    5   2.010 ± 0.224   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                           25    ss    5   3.961 ± 0.137   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                           50    ss    5   5.771 ± 0.839   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     1000000                          100    ss    5   9.082 ± 0.869   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                            5    ss    5   3.772 ± 0.602   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                           25    ss    5   6.558 ± 0.939   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                           50    ss    5   9.179 ± 1.330   s/op
RewriteDataFilesBenchmark.rewriteDataFiles     2000000                          100    ss    5  23.623 ± 8.387   s/op

@nastra nastra marked this pull request as draft June 3, 2025 07:39
@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch from a41fe6e to 13b0f91 Compare June 3, 2025 09:04
@github-actions github-actions bot added the flink label Jun 3, 2025
@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch 2 times, most recently from 5dbb131 to f82cbdc Compare June 3, 2025 13:45
@nastra nastra changed the title Core: Remove DV from manifest when removing data file Core: Keep track of data files to be removed for orphaned DV detection Jul 3, 2025
@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch from f82cbdc to 4dd4a3d Compare July 3, 2025 06:13
@github-actions github-actions bot added the API label Jul 3, 2025
@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch from 4dd4a3d to 2d8ea9a Compare July 3, 2025 06:19
@nastra nastra marked this pull request as ready for review July 3, 2025 07:11
@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch from 2d8ea9a to 0338bfe Compare July 3, 2025 07:44
@github-actions github-actions bot removed the API label Jul 3, 2025
@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch from 0338bfe to 4d8fd4c Compare July 3, 2025 07:47
@nastra nastra closed this Jul 3, 2025
@nastra nastra reopened this Jul 3, 2025
@@ -224,7 +235,9 @@ List<ManifestFile> filterManifests(Schema tableSchema, List<ManifestFile> manife
private boolean canTrustManifestReferences(List<ManifestFile> manifests) {
Set<String> manifestLocations =
manifests.stream().map(ManifestFile::path).collect(Collectors.toSet());
return allDeletesReferenceManifests && manifestLocations.containsAll(manifestsWithDeletes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar manifestLocations.containsAll(manifestsWithDeletes) was previously always true even though manifestsWithDeletes was empty, hence I've added a non-empty check here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra I think the additional check is fine but is there a situation where allDeletesReferenceManifests is true and the manifestsWithDeletes is empty? Every delete operation would either invalidate allDeletesReferenceManifests or if it's a delte(File f) where the file has defined a manifest location we'd add to the manifestLocations set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while debugging things locally I've actually ran into such a situation. I actually forgot which test it was exactly but that was one of the reason why I added this additional non-empty check

Copy link
Contributor

@stevenzwu stevenzwu Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifestLocations.containsAll(manifestsWithDeletes) was previously always true even though manifestsWithDeletes was empty

Why is this a problem?

is there a situation where allDeletesReferenceManifests is true and the manifestsWithDeletes is empty?

To Amogh's question, this would be the case when there is no delete operation. In this case, should canTrustManifestReferences return true or false?

Copy link
Contributor

@stevenzwu stevenzwu Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean trustManifestReferences is used by canContainDeletedFiles method.

When trustManifestReferences is true (old behavior), this code would return false since manifestsWithDeletes is empty.

    if (trustManifestReferences) {
      return manifestsWithDeletes.contains(manifest.path());
    }

When trustManifestReferences is false (new behavior), the above return will be skipped. So it went on to execute this block.

    return canContainDroppedFiles(manifest)
        || canContainExpressionDeletes(manifest)
        || canContainDroppedPartitions(manifest);

The above block would always be false when allDeletesReferenceManifests is true, as any of these conditions would render allDeletesReferenceManifests false.

It seems that both old and new behaviors are essentially the same when allDeletesReferenceManifests is true and the manifestsWithDeletes is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good example that makes it clearer is when running TestDeleteFiles.removingDataFileByPathAlsoRemovesDV (or any other new tests that I added). When we hit canTrustManifestReferences() the allDeletesReferenceManifests flag is set to true, the manifestLocations contain one entry but manifestsWithDeletes is actually empty.
This results in canTrustManifestReferences being evaluated to true, which will then exit early in

if (!canContainDeletedFiles(manifest, trustManifestReferences)) {
filteredManifests.put(manifest, manifest);
return manifest;
}

which is because it also exited early in
if (trustManifestReferences) {
return manifestsWithDeletes.contains(manifest.path());
}

The issue there is that manifestsWithDeletes in that line is still empty and so this condition there can never be true. IMO trustManifestReferences should be false in that case because manifestsWithDeletes was empty.

The above block would always be false when allDeletesReferenceManifests is true, as any of these conditions would render allDeletesReferenceManifests false.

Not necessarily. While this might be true for the DataFileFilterManager it's slightly different for the DeleteFileFilterManager. Just assume you're deleting a DataFile, which renders allDeletesReferenceManifests to false in DataFileFilterManager. In the DeleteFileFilterManager allDeletesReferenceManifests is true by default.
I haven't explored this, but maybe we would need to set allDeletesReferenceManifests = false as well when removeDanglingDeletesFor() is being called but I think adding this additional non-empty check here still makes sense to me.

I'll go ahead and merge the PR to unblock the release. If necessary we can follow up here afterwards.

@stevenzwu stevenzwu added this to the Iceberg 1.10.0 milestone Jul 3, 2025
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some early comments

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something but I think we should double check path based/partition drops/removals by filters, I'm not sure we're capturing those data file entries for passing to the logic to remove orphan DVs for.

@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch 3 times, most recently from a07ae28 to 1233d04 Compare July 4, 2025 12:04
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just some nits, thank @nastra !

F copyWithoutStats = file.copyWithoutStats();
// add the file here in case it was deleted using an expression. The
// DeleteManifestFilterManager will then remove its matching DV
deleteFiles.add(copyWithoutStats);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. If an entry is marked for delete or passes strict evaluation, then we're marking it for delete and we're adding to the tracking set of files for which we have to hunt down and cleanup orphans for.

@nastra nastra force-pushed the remove-dv-from-manifest-when-removing-data-file branch from 1233d04 to 3e6decc Compare July 7, 2025 17:23
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just have a question on the canTrustManifestReferences condition change. please check if my understanding is correct.

@nastra
Copy link
Contributor Author

nastra commented Jul 8, 2025

thanks for the reviews @amogh-jahagirdar and @stevenzwu. I'll go ahead and merge this to unblock the release and we can follow up on #13222 (comment) if necessary

@nastra nastra merged commit 061ae58 into apache:main Jul 8, 2025
42 checks passed
@nastra nastra deleted the remove-dv-from-manifest-when-removing-data-file branch July 8, 2025 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants