Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Make PositionDeleteIndex serializable #11463

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Nov 4, 2024

This PR makes PositionDeleteIndex serializable as per the V3 spec.

This work is part of #11122.

@github-actions github-actions bot added the core label Nov 4, 2024
* magic bytes and bitmap for compatibility with Delta.
*/
@Override
public ByteBuffer serialize() {
Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puffin writers currently assume we will create and pass ByteBuffer. We also compute a checksum on serialized magic bytes and bitmap, so I am allocating a byte array upfront.

@@ -43,6 +53,11 @@ class BitmapPositionDeleteIndex implements PositionDeleteIndex {
this.deleteFiles = deleteFile != null ? Lists.newArrayList(deleteFile) : Lists.newArrayList();
}

BitmapPositionDeleteIndex(RoaringPositionBitmap bitmap, DeleteFile deleteFile) {
this.bitmap = bitmap;
this.deleteFiles = deleteFile != null ? Lists.newArrayList(deleteFile) : Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where we can have a deserialized bitmap, but no delete file? Just wondering if we should have a separate constructor or if it makes sense to allow passing null for those cases.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I have this check here to match the constructor above. There are some legacy code paths that create an index and pass null as DeleteFile. Given that it is a new constructor, we can probably require the file to be non-null, but it won't be consistent with the constructor above.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

*/
@Override
public ByteBuffer serialize() {
bitmap.runLengthEncode(); // run-length encode the bitmap before serializing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we more eagerly run length encode the bitmap in the constructor instead of at serialization time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, I don't think it'll make any difference and probably more readable the way it's written at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually want to accumulate all values prior to applying run-length encoding, so doing this right before the serialization seems like a good place. We don't anticipate the bitmap to change after serialization.

Comment on lines +203 to +204
long cardinality = bitmap.cardinality();
long expectedCardinality = deleteFile.recordCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: I feel like we could inline bitMap.cardinality() == deleteFile.recordCount() in the check below and the intent of the check would still be clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this for consistency with the method above. It shouldn't be expensive to call recordCount() twice, so I am OK changing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me actually keep it to avoid one more round of CI.

@aokolnychyi aokolnychyi merged commit 43b2f7d into apache:main Nov 5, 2024
49 checks passed
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @danielcweeks @amogh-jahagirdar!

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants