-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Core: Keep track of data files to be removed for orphaned DV detection #13222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Keep track of data files to be removed for orphaned DV detection #13222
Conversation
a41fe6e
to
13b0f91
Compare
5dbb131
to
f82cbdc
Compare
f82cbdc
to
4dd4a3d
Compare
4dd4a3d
to
2d8ea9a
Compare
2d8ea9a
to
0338bfe
Compare
0338bfe
to
4d8fd4c
Compare
@@ -224,7 +235,9 @@ List<ManifestFile> filterManifests(Schema tableSchema, List<ManifestFile> manife | |||
private boolean canTrustManifestReferences(List<ManifestFile> manifests) { | |||
Set<String> manifestLocations = | |||
manifests.stream().map(ManifestFile::path).collect(Collectors.toSet()); | |||
return allDeletesReferenceManifests && manifestLocations.containsAll(manifestsWithDeletes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar manifestLocations.containsAll(manifestsWithDeletes)
was previously always true even though manifestsWithDeletes
was empty, hence I've added a non-empty check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra I think the additional check is fine but is there a situation where allDeletesReferenceManifests
is true and the manifestsWithDeletes is empty? Every delete operation would either invalidate allDeletesReferenceManifests
or if it's a delte(File f) where the file has defined a manifest location we'd add to the manifestLocations set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while debugging things locally I've actually ran into such a situation. I actually forgot which test it was exactly but that was one of the reason why I added this additional non-empty check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manifestLocations.containsAll(manifestsWithDeletes) was previously always true even though manifestsWithDeletes was empty
Why is this a problem?
is there a situation where allDeletesReferenceManifests is true and the manifestsWithDeletes is empty?
To Amogh's question, this would be the case when there is no delete operation. In this case, should canTrustManifestReferences
return true or false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boolean trustManifestReferences
is used by canContainDeletedFiles
method.
When trustManifestReferences
is true (old behavior), this code would return false since manifestsWithDeletes
is empty.
if (trustManifestReferences) {
return manifestsWithDeletes.contains(manifest.path());
}
When trustManifestReferences
is false (new behavior), the above return will be skipped. So it went on to execute this block.
return canContainDroppedFiles(manifest)
|| canContainExpressionDeletes(manifest)
|| canContainDroppedPartitions(manifest);
The above block would always be false when allDeletesReferenceManifests
is true, as any of these conditions would render allDeletesReferenceManifests
false.
It seems that both old and new behaviors are essentially the same when allDeletesReferenceManifests is true and the manifestsWithDeletes is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good example that makes it clearer is when running TestDeleteFiles.removingDataFileByPathAlsoRemovesDV
(or any other new tests that I added). When we hit canTrustManifestReferences()
the allDeletesReferenceManifests
flag is set to true
, the manifestLocations
contain one entry but manifestsWithDeletes
is actually empty.
This results in canTrustManifestReferences
being evaluated to true
, which will then exit early in
iceberg/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Lines 357 to 360 in 3e6decc
if (!canContainDeletedFiles(manifest, trustManifestReferences)) { | |
filteredManifests.put(manifest, manifest); | |
return manifest; | |
} |
which is because it also exited early in
iceberg/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Lines 385 to 387 in 3e6decc
if (trustManifestReferences) { | |
return manifestsWithDeletes.contains(manifest.path()); | |
} |
The issue there is that
manifestsWithDeletes
in that line is still empty and so this condition there can never be true
. IMO trustManifestReferences
should be false
in that case because manifestsWithDeletes
was empty.
The above block would always be false when allDeletesReferenceManifests is true, as any of these conditions would render allDeletesReferenceManifests false.
Not necessarily. While this might be true for the DataFileFilterManager
it's slightly different for the DeleteFileFilterManager
. Just assume you're deleting a DataFile, which renders allDeletesReferenceManifests
to false
in DataFileFilterManager
. In the DeleteFileFilterManager
allDeletesReferenceManifests
is true
by default.
I haven't explored this, but maybe we would need to set allDeletesReferenceManifests = false
as well when removeDanglingDeletesFor()
is being called but I think adding this additional non-empty check here still makes sense to me.
I'll go ahead and merge the PR to unblock the release. If necessary we can follow up here afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some early comments
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something but I think we should double check path based/partition drops/removals by filters, I'm not sure we're capturing those data file entries for passing to the logic to remove orphan DVs for.
a07ae28
to
1233d04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just some nits, thank @nastra !
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Outdated
Show resolved
Hide resolved
F copyWithoutStats = file.copyWithoutStats(); | ||
// add the file here in case it was deleted using an expression. The | ||
// DeleteManifestFilterManager will then remove its matching DV | ||
deleteFiles.add(copyWithoutStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. If an entry is marked for delete or passes strict evaluation, then we're marking it for delete and we're adding to the tracking set of files for which we have to hunt down and cleanup orphans for.
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Outdated
Show resolved
Hide resolved
core/src/jmh/java/org/apache/iceberg/RewriteDataFilesBenchmark.java
Outdated
Show resolved
Hide resolved
1233d04
to
3e6decc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Just have a question on the canTrustManifestReferences
condition change. please check if my understanding is correct.
thanks for the reviews @amogh-jahagirdar and @stevenzwu. I'll go ahead and merge this to unblock the release and we can follow up on #13222 (comment) if necessary |
This keeps track of all data files to be removed/rewritten in MergingSnapshotProducer and passes those to the ManifestFilterManager for deletes. Once ManifestFilterManager goes through delete manifests, it also checks whether a DV references any of the data files to be removed.
This is needed in addition to #13245 so that we can properly remove orphaned DVs when e.g. a metadata-only delete is performed.
The below benchmark shows that tracking data files to be removed and then detecting orphaned DVs when delete manifests are looked at is only adding a small fraction to the throughput.
without tracking data files to be removed
with tracking data files to be removed