Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(manifests): Adding implementation of manifest files #3

Merged
merged 7 commits into from
Sep 23, 2023

Conversation

zeroshade
Copy link
Member

Adds an implementation for Manifest Lists, Manifest Entries, and Data Files along with interfaces for wrapping file system IO handling.

@zeroshade
Copy link
Member Author

@zeroshade
Copy link
Member Author

@coded9 @Fokko @nastra @rdblue any chance for a review soon to get this in?

@rdblue
Copy link

rdblue commented Sep 7, 2023

@zeroshade, sorry for the delay! I'll find some time for reviews over here.

io/io.go Outdated Show resolved Hide resolved
io/io.go Outdated Show resolved Hide resolved
io/io.go Outdated Show resolved Hide resolved
@zeroshade
Copy link
Member Author

Rebased with the merge for the partitioning stuff, so should be all good here

io/s3.go Outdated
}

if defaultRegion, ok := props[S3Region]; ok {
opts = append(opts, config.WithDefaultRegion(defaultRegion))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure, but I think we might rather want to configure the region here instead of the default region using config.WithRegion(). Also we might want to rename defaultRegion to region to make a clear distinction that S3Region stands for the region (not the default region)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in this situation since we're not really exposing the ability for consumers to set their own options explicitly, I think it ends up being basically identical if we use WithDefaultRegion vs WithRegion as the only real difference between them is that any WithDefaultRegion will always be superceded by any WithRegion options in the config.

By the same token since it wouldn't have any real difference, I agree that it makes sense from a code perspective for me to switch these to use WithRegion rather than WithDefaultRegion.

io/s3.go Outdated Show resolved Hide resolved
@@ -0,0 +1,108 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this file LGTM. what would be great is to have some tests that use S3 against a minio container, which can be done in a follow-up PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I was planning on doing that in a follow-up PR explicitly. I have the test stuff worked out already and, in conjunction with the iceberg REST API catalog impl I was able to replicate a similar integration test to what pyiceberg uses 😄

manifest.go Outdated
Path string `avro:"manifest_path"`
Len int64 `avro:"manifest_length"`
PartitionSpecID int32 `avro:"partition_spec_id"`
Content ManifestContent `avro:"content"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i had a question here: according to https://iceberg.apache.org/spec/#manifests the added_snapshot_id is required in both V1 and V2, but pyiceberg's definition of the manifest file schema (here: https://github.com/apache/iceberg/blob/master/python/pyiceberg/manifest.py#L275) marks it as optional. Which should I follow here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Fokko

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it seems that the Spec itself for v1 should have added_snapshot_id as optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened apache/iceberg#8600 to fix the Spec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade just FYI that added_snapshot_id is mandatory after all for V1 and V2, so you might want to follow-up and make it required

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing! i'll make a PR for that, thanks for the heads up

manifest.go Outdated Show resolved Hide resolved
manifest.go Outdated Show resolved Hide resolved
manifest.go Outdated Show resolved Hide resolved
manifest.go Outdated Show resolved Hide resolved
manifest.go Show resolved Hide resolved
PartitionData map[string]any `avro:"partition"`
RecordCount int64 `avro:"record_count"`
FileSize int64 `avro:"file_size_in_bytes"`
BlockSizeInBytes int64 `avro:"block_size_in_bytes"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shouldn't be written for V2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the fields don't exist in the defined schema for V2 they shouldn't get written when writing the file since we write according to the schema

}

type dataFile struct {
Content ManifestEntryContent `avro:"content"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only exists for V2 but not for V1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bad thing for it to get written for a v1 file? (readers would just ignore the field, right?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers would/should ignore it. Would it cause too much trouble to not write this field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it potentially would avoid writing this field for V1 since it isn't in the schema, but I'll double check

Copy link
Member Author

@zeroshade zeroshade Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that if i start with a datafile containing some non-default value for Content the round trip test comes back with the default being what the struct has. so it doesn't get written and won't be read back out since the V1 schemas don't have the field. updated the unit test accordingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks for confirming

manifest.go Outdated Show resolved Hide resolved
manifest.go Outdated Show resolved Hide resolved
manifest.go Outdated Show resolved Hide resolved
// SortOrderID returns the id representing the sort order for this
// file, or nil if there is no sort order.
SortOrderID() *int
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i'm reading through the Java code and the best I can see is that the ContentFile interface wraps around the entire ManifestEntry objects and essentially surfaces the individual fields of the data file to the same level as the manifest entry information (which includes the DataSequenceNumber and FileSequenceNumber. But if you look at the spec (https://iceberg.apache.org/spec/#manifests) those values exist in the manifest_entry struct, not the data_file struct, which is why they aren't exposed here at this level.

Looking at the Builder for data files in the Java code (https://github.com/apache/iceberg/blob/d6bc248adb67de74e31dcb9c0af43fef68853d59/core/src/main/java/org/apache/iceberg/DataFiles.java) it also doesn't allow setting the sequence number values down there. You can see both the Data Sequence number and File Sequence Number in the ManifestEntry interface in the Go code.

@nastra
Copy link
Contributor

nastra commented Sep 15, 2023

@zeroshade I did a first pass around the manifest schema and code and left a few comments. I still need to review the APIs for IO and the tests

@zeroshade
Copy link
Member Author

@nastra I've implemented the suggested fixes but I had a few questions that I posed when you get a chance. Thanks again!

internal/avro_schemas.go Outdated Show resolved Hide resolved
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zeroshade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants